Collocation by Chi Square

Roughly speaking, a "collocation" is an -gram (subsequence) that appears more frequently in some sequence than would be expected if the constituent parts of the -gram were drawn at random. One way of making this definition more precise is to create a matrix. This Demonstration shows how this concept can be built into an algorithm when the sequence is stored as a "trie", that is, a data structure that converts "sentences" of symbols into a tree structure by requiring that all -grams up to a certain length that appear in a sentence have as their parent "most" of that -gram, that is, all but the rightmost element of the original -gram.
You select the elementary cellular automaton that generates the sentences. You select whether three or four is the maximum length of -grams examined. You further select which -gram you wish to consider. And you select where you want to slice the -gram to convert it into a pseudo-bigram. The system responds with two figures: a trie and a reverse trie, both of which label edges according to the number of times the destination -gram follows the source -gram. In the trie, the -gram you selected is circled in green. The first part of your -gram is circled in magenta. -grams that share the first part of your -gram but that do not share the second part are circled in blue. In the reverse trie, the -gram you selected is again circled in green. The second part of your -gram is circled in magenta. -grams that share the second part of your -gram but that do not share the first part are circled in cyan. -grams that share neither the first nor second part of your -gram are circled in orange. The system then produces the corresponding matrix with the elements of the matrix color coded in the same way as the -grams in the figures. Thus, if blue nodes have incoming edge counts of 7, 7, 15, 17, 15, 11, 21, the top right element of the matrix will be the sum of these edge counts (93) and colored blue. The bottom row of the output shows the values associated with the matrix and shows the likelihood that the sequence of values in the -gram would occur as frequently as would be the case if the values were chosen at random.



  • [Snapshot]
  • [Snapshot]
  • [Snapshot]


The matrix of a bigram with parts and is
where count represents the number of times an -gram appears in the sequence and represents any -gram except . Thus, represents any length-two -gram the first part of which is but the second part of which is not . One can then calculate the value for this matrix and the probability (-statistic) that the first and second parts of the bigram would appear together as frequently as they do by chance. This concept can be extended to -grams by cutting the -gram at all positions (where is the length of the -gram), finding the matrix for this "pseudo-bigram", and then taking the mean or maximum of the results.
A discussion of this methodology may be found at:
J. F. da Silva and G. P. Lopes, "A Local Maxima Method and Fair Dispersion Normalization for Extracting Multi-Word Units from Corpora," in Proceedings of the 6th Meeting on Mathematics of Language, pp. 369–381, Orlando, July 1999.
    • Share:

Embed Interactive Demonstration New!

Just copy and paste this snippet of JavaScript code into your website or blog to put the live Demonstration on your site. More details »

Files require Wolfram CDF Player or Mathematica.

Mathematica »
The #1 tool for creating Demonstrations
and anything technical.
Wolfram|Alpha »
Explore anything with the first
computational knowledge engine.
MathWorld »
The web's most extensive
mathematics resource.
Course Assistant Apps »
An app for every course—
right in the palm of your hand.
Wolfram Blog »
Read our views on math,
science, and technology.
Computable Document Format »
The format that makes Demonstrations
(and any information) easy to share and
interact with.
STEM Initiative »
Programs & resources for
educators, schools & students. »
Join the initiative for modernizing
math education.
Step-by-Step Solutions »
Walk through homework problems one step at a time, with hints to help along the way.
Wolfram Problem Generator »
Unlimited random practice problems and answers with built-in step-by-step solutions. Practice online or make a printable study sheet.
Wolfram Language »
Knowledge-based programming for everyone.
Powered by Wolfram Mathematica © 2018 Wolfram Demonstrations Project & Contributors  |  Terms of Use  |  Privacy Policy  |  RSS Give us your feedback
Note: To run this Demonstration you need Mathematica 7+ or the free Mathematica Player 7EX
Download or upgrade to Mathematica Player 7EX
I already have Mathematica Player or Mathematica 7+