Collocation by Chi Square

Roughly speaking, a "collocation" is an -gram (subsequence) that appears more frequently in some sequence than would be expected if the constituent parts of the -gram were drawn at random. One way of making this definition more precise is to create a matrix. This Demonstration shows how this concept can be built into an algorithm when the sequence is stored as a "trie", that is, a data structure that converts "sentences" of symbols into a tree structure by requiring that all -grams up to a certain length that appear in a sentence have as their parent "most" of that -gram, that is, all but the rightmost element of the original -gram.
You select the elementary cellular automaton that generates the sentences. You select whether three or four is the maximum length of -grams examined. You further select which -gram you wish to consider. And you select where you want to slice the -gram to convert it into a pseudo-bigram. The system responds with two figures: a trie and a reverse trie, both of which label edges according to the number of times the destination -gram follows the source -gram. In the trie, the -gram you selected is circled in green. The first part of your -gram is circled in magenta. -grams that share the first part of your -gram but that do not share the second part are circled in blue. In the reverse trie, the -gram you selected is again circled in green. The second part of your -gram is circled in magenta. -grams that share the second part of your -gram but that do not share the first part are circled in cyan. -grams that share neither the first nor second part of your -gram are circled in orange. The system then produces the corresponding matrix with the elements of the matrix color coded in the same way as the -grams in the figures. Thus, if blue nodes have incoming edge counts of 7, 7, 15, 17, 15, 11, 21, the top right element of the matrix will be the sum of these edge counts (93) and colored blue. The bottom row of the output shows the values associated with the matrix and shows the likelihood that the sequence of values in the -gram would occur as frequently as would be the case if the values were chosen at random.

THINGS TO TRY

SNAPSHOTS

  • [Snapshot]
  • [Snapshot]
  • [Snapshot]

DETAILS

The matrix of a bigram with parts and is
where count represents the number of times an -gram appears in the sequence and represents any -gram except . Thus, represents any length-two -gram the first part of which is but the second part of which is not . One can then calculate the value for this matrix and the probability (-statistic) that the first and second parts of the bigram would appear together as frequently as they do by chance. This concept can be extended to -grams by cutting the -gram at all positions (where is the length of the -gram), finding the matrix for this "pseudo-bigram", and then taking the mean or maximum of the results.
A discussion of this methodology may be found at:
J. F. da Silva and G. P. Lopes, "A Local Maxima Method and Fair Dispersion Normalization for Extracting Multi-Word Units from Corpora," in Proceedings of the 6th Meeting on Mathematics of Language, pp. 369–381, Orlando, July 1999.
    • Share:

Embed Interactive Demonstration New!

Just copy and paste this snippet of JavaScript code into your website or blog to put the live Demonstration on your site. More details »

Files require Wolfram CDF Player or Mathematica.