Collocation by Chi Square

Initializing live version

Roughly speaking, a "collocation" is an -gram (subsequence) that appears more frequently in some sequence than would be expected if the constituent parts of the -gram were drawn at random. One way of making this definition more precise is to create a matrix. This Demonstration shows how this concept can be built into an algorithm when the sequence is stored as a "trie", that is, a data structure that converts "sentences" of symbols into a tree structure by requiring that all -grams up to a certain length that appear in a sentence have as their parent "most" of that -gram, that is, all but the rightmost element of the original -gram.

[more]

Contributed by: Seth J. Chandler (June 2008)
Open content licensed under CC BY-NC-SA

Snapshots

Details

The matrix of a bigram with parts and is

where count represents the number of times an -gram appears in the sequence and represents any -gram except . Thus, represents any length-two -gram the first part of which is but the second part of which is not . One can then calculate the value for this matrix and the probability (-statistic) that the first and second parts of the bigram would appear together as frequently as they do by chance. This concept can be extended to -grams by cutting the -gram at all positions (where is the length of the -gram), finding the matrix for this "pseudo-bigram", and then taking the mean or maximum of the results.

A discussion of this methodology may be found at:

J. F. da Silva and G. P. Lopes, "A Local Maxima Method and Fair Dispersion Normalization for Extracting Multi-Word Units from Corpora," in Proceedings of the 6th Meeting on Mathematics of Language, pp. 369–381, Orlando, July 1999.

Permanent Citation

Seth J. Chandler "Collocation by Chi Square"
http://demonstrations.wolfram.com/CollocationByChiSquare/
Wolfram Demonstrations Project
Published: June 6 2008


Feedback (field required)

Email (field required)	Name

Occupation	Organization

Note: Your message & contact information may be shared with the author of any specific Demonstration for which you give feedback. Send

Collocation by Chi Square

Snapshots

Details

Related Links

Permanent Citation