Comparing Information Retrieval Evaluation Measures

Initializing live version
Download to Desktop

Requires a Wolfram Notebook System

Interact on desktop, mobile and cloud with the free Wolfram Player or other Wolfram Language products.

Compare some common evaluation measures for information retrieval on the results given by two systems to the same query. It is assumed that both systems retrieve the same number of results for that query; this number can be considered as a cutoff value.


Mark a retrieved item as relevant or nonrelevant by clicking a number on either side and watch how measures change. Mouseover the measures to see their definitions.

In order to evaluate and compare information retrieval systems, traditionally one uses a "test collection" composed of three entities: • a set of queries (or "information needs") • a target dataset (where to look for items that will satisfy the information needs) • a set of relevance assessments (the items that satisfy the information needs)

This test collection is generated artificially and it is assumed to be available.

Based on this test collection one can compare the results of a query using two different information retrieval systems. The items retrieved by the two systems are represented by the two sequences of circles on the left and right.

There are two ways to interact with this Demonstration: 1. by using the controls at the top 2. by marking items as relevant/nonrelevant in the graphics

Use the controls to set the number of relevant and nonrelevant items for the query being considered. These numbers are characteristics of the test collection for each given query.

Relevant and nonrelevant items do not necessarily sum up to all the items in the collection, but they are those items in the collection that are known to be relevant or not to a given query because their relevance has been assessed (or inferred).

The main concern of information retrieval evaluation is to decide the better retrieval system based on a test collection. Is the system that retrieves the most relevant items the better one? Or the one with the higher average precision? Or the one with higher normalized cumulative gain?


Contributed by: Giovanna Roda (March 2011)
Open content licensed under CC BY-NC-SA




Wikipedia, "Information Retrieval".

E. M. Voorhees, "The Philosophy of Information Retrieval Evaluation," in: Evaluation of Cross-Language Information Retrieval Systems, Lecture Notes in Computer Science 2001, pp. 143–170.

T. Sakai, "Alternatives to Bpref," SIGIR '07 Proceedings, ACM, 2007 pp. 71–78.

Feedback (field required)
Email (field required) Name
Occupation Organization
Note: Your message & contact information may be shared with the author of any specific Demonstration for which you give feedback.