Collocation by Chi Square
Requires a Wolfram Notebook System
Interact on desktop, mobile and cloud with the free Wolfram Player or other Wolfram Language products.
Roughly speaking, a "collocation" is an -gram (subsequence) that appears more frequently in some sequence than would be expected if the constituent parts of the -gram were drawn at random. One way of making this definition more precise is to create a matrix. This Demonstration shows how this concept can be built into an algorithm when the sequence is stored as a "trie", that is, a data structure that converts "sentences" of symbols into a tree structure by requiring that all -grams up to a certain length that appear in a sentence have as their parent "most" of that -gram, that is, all but the rightmost element of the original -gram.
[more]
Contributed by: Seth J. Chandler (June 2008)
Open content licensed under CC BY-NC-SA
Snapshots
Details
The matrix of a bigram with parts and is
where count represents the number of times an -gram appears in the sequence and represents any -gram except . Thus, represents any length-two -gram the first part of which is but the second part of which is not . One can then calculate the value for this matrix and the probability (-statistic) that the first and second parts of the bigram would appear together as frequently as they do by chance. This concept can be extended to -grams by cutting the -gram at all positions (where is the length of the -gram), finding the matrix for this "pseudo-bigram", and then taking the mean or maximum of the results.
A discussion of this methodology may be found at:
J. F. da Silva and G. P. Lopes, "A Local Maxima Method and Fair Dispersion Normalization for Extracting Multi-Word Units from Corpora," in Proceedings of the 6th Meeting on Mathematics of Language, pp. 369–381, Orlando, July 1999.
Permanent Citation