Collocation by Symmetric Conditional Probability

Initializing live version
Download to Desktop

Requires a Wolfram Notebook System

Interact on desktop, mobile and cloud with the free Wolfram Player or other Wolfram Language products.

Roughly speaking, a "collocation" is an -gram (subsequence) that appears more frequently in some sequence than would be expected if the constituent parts of the -gram were drawn at random. One way of making this definition more precise is to use the notion of "symmetric conditional probability". For a bigram (i.e., an -gram with two parts), its symmetric conditional probability is the product of (1) the conditional probability that the second half of the bigram would appear given the first half of the bigram and (2) the conditional probability that the first half of the bigram would appear given the second half of the bigram. Bigrams with high specific conditional probability may be thought of as collocations. This concept can be extended to -grams by cutting the -gram at all positions (where is the length of the -gram), finding the symmetric conditional probability of this "pseudo-bigram", and then taking the mean of the results.

[more]

This Demonstration shows how this concept can be built into an algorithm when the sequence is stored as a "trie", that is, a data structure that converts "sentences" of symbols into a tree structure by requiring that all -grams up to a certain length that appear in a sentence have as their parent "most" of that -gram, that is, all but the rightmost element of the original -gram. You select the elementary cellular automaton that generates the sentences. You select whether 3 or 4 is the maximum length of -grams examined. You further select which -gram of length 2 or more you wish to consider. And you select where you want to slice the -gram to convert it into a pseudo-bigram. The system responds with a figure showing the conditional probability of the second part of the pseudo-bigram given the first part of the bigram and a figure showing the conditional probability of the first part of the pseudo-bigram given the second part of the pseudo-bigram. The green circle shows the -gram you selected. The purple dashed -gram shows the relevant parent of that -gram and the blue dashed -grams show all appropriate children of the parent. The bottom row of the output shows how symmetric conditional probability (the mean of the products of the conditional probabilities over all possible slicings of the -gram) is done.

[less]

Contributed by: Seth J. Chandler (March 2011)
Open content licensed under CC BY-NC-SA


Snapshots


Details

The Demonstration executes faster when the maximum length of -grams examined is set to 3.

An application of the concept behind this Demonstration will be presented at the 2008 International Mathematica Symposium, where it will be used as part of an effort to find critical concepts in the leading international treaty governing sales of goods.

Snapshot 1: Examines the -gram {0,0,0} in sentences generated by Rule 153. Slice position for creating the pseudo-bigram is 1.

Snapshot 2: Examines the -gram {0,0,0} in sentences generated by Rule 153. Slice position for creating the pseudo-bigram is 2.

Snapshot 3: Examines the -gram {0,0,0} in sentences generated by Rule 130. Slice position for creating the pseudo-bigram is 1. It has a high symmetric conditional probability and could be a collocation.



Feedback (field required)
Email (field required) Name
Occupation Organization
Note: Your message & contact information may be shared with the author of any specific Demonstration for which you give feedback.
Send