Term Weighting with TF-IDF

Initializing live version

term weighting

none

term frequency

raw document frequency

inverse document frequency

TF-IDF

stopwords grayed out

charles bailey was indicted for feloniously stealing on the 29th of december two dressed deer skins value 20 s the property of samuel savage and richard savage richard savage i am a leather seller 63 chiswell street my partner s name is samuel savage a few days previous to the 29th of december i looked out seventy skins for an order these skins being of a bad colour i directed them to be brimstoned to make them of equal colour pale on the 29th in the afternoon i saw them all smooth on a horse a few hours afterwards they appeared very much tumbled and one was thrown into the yard and dirtied i caused them to be brought in the warehouse and counted there was two gone our foreman went to worship street and brought armstrong and vickrey they searched and found this skin in the prisoner s breeches and the other skin was found in the workshop carter i am foreman to samuel and richard savage the seventy skins i was with mr savage looking them out i took them out of the stove and counted them on the horse and on friday i counted them three times over there were no more than sixty eight instead of seventy i went to worship street brought mr armstrong and vickery with me they waited till the men left work and when they came down they were searched and on the prisoner one skin was found john armstrong i went to this gentleman s house after the men came down vickrey and i were searching in one minute vickrey called me i received this skin from him it was taken out of the prisoner s breeches i have had it ever since john vickrey q you were with armstrong a yes while i was searching another man i saw the prisoner very uneasy and his breeches were unbuttoned i put my hand in and took that skin out he said he could not tell how it came there the property produced and identified the prisoner said nothing in his defence called four witnesses who gave him a good character guilty aged 27 confined six months in the house of correction and fined 1 s second middlesex jury before mr recorder

Download to Desktop

TF-IDF (term frequency-inverse document frequency) is a way of determining which terms in a document should be weighted most heavily when trying to understand what the document is about. The term frequency reflects how often a given term appears in the document of interest. The document frequency is measured with respect to a corpus of other documents. It tells you how often the term appears in your corpus overall. The terms that are most informative about a particular text have a high term frequency and a low document frequency. The TF-IDF for a term is the product of its term frequency and the scaled inverse of its document frequency. Stopwords are those words that occur so frequently in the language that they rarely convey information about the meaning of a particular document. In this Demonstration, stopwords can be turned on and off, and the font size of each term can be scaled by various weighting factors.

Contributed by: William J. Turkel (March 2011)
Open content licensed under CC BY-NC-SA

Snapshots

Details

This Demonstration uses Old Bailey Online t18100110-41. The built-in Mathematica function WordData was used to generate a list of stopwords for this Demonstration.

The Old Bailey Online contains digital versions of almost 200,000 criminal trials held at London's central criminal court between 1674 and 1913.

This work is part of the Criminal Intent project, funded by a 2009 Digging into Data challenge grant.

For more about TF-IDF, see [1].

Reference

[1] C. D. Manning, P. Raghavan, and H. Schütze, Introduction to Information Retrieval, Cambridge: Cambridge University Press, 2008.

Permanent Citation

William J. Turkel "Term Weighting with TF-IDF"
http://demonstrations.wolfram.com/TermWeightingWithTFIDF/
Wolfram Demonstrations Project
Published: March 7 2011


Feedback (field required)

Email (field required)	Name

Occupation	Organization

Note: Your message & contact information may be shared with the author of any specific Demonstration for which you give feedback. Send

Term Weighting with TF-IDF

Snapshots

Details

Related Links

Permanent Citation