|
INEX 2011 Snippet Retrieval Track
|
|
|
Announcements
Final results available.
Overview
The goal of the snippet retrieval track is to determine how best to generate informative snippets for search results. Such snippets should provide sufficient information to allow the user to determine the relevance of each document, without needing to view the document itself. Participating organisations will compare both the effectiveness of their focused retrieval systems as well as the effectiveness of their snippet generation systems, to others.
Data
The snippet retrieval track will use the INEX Wikipedia collection. Topics will be recycled from previous ad hoc tracks.
Task
Participating organisations will submit a ranked list of documents, and corresponding snippets. For those organisations uninterested in developing their own focused retrieval system, a reference run will be provided, consisting of a ranked list of documents, for which snippets should be created.
Each run will be submitted in the form of an XML file (format described below). Each submission should contain 500 snippets per topic, with a maximum of 300 characters per snippet. The snippets themselves may be created in any way you wish – they may consist of summaries, passages from the document, or any other text at all.
Participants may submit more than one submission; however submissions must be ranked in order of importance, as not all submissions may be evaluated.
The DTD for the submission format is as follows.
<!ELEMENT inex-snippet-submission (description,topic+)>
<!ATTLIST inex-snippet-submission
participant-id CDATA #REQUIRED
run-id CDATA #REQUIRED
>
<!ELEMENT description (#PCDATA)>
<!ELEMENT topic (snippet+)>
<!ATTLIST topic
topic-id CDATA #REQUIRED
>
<!ELEMENT snippet (#PCDATA)>
<!ATTLIST snippet
doc-id CDATA #REQUIRED
rsv CDATA #REQUIRED
>
Each submission must contain the following:
- participant-id: The participant number of the submitting institution (available here)
- run-id: A run ID, which must be unique across all submissions sent from a single participating organisation.
- description: A description of the approach used.
Every run should contain the results for each topic, conforming to the following:
- topic: Contains a ranked list of snippets, ordered by decreasing level of relevance of the document which they represent.
- topic-id: The ID number of the topic.
- snippet: A snippet representing a document (defined by doc-id).
- doc-id: The document ID.
- rsv: The document's score.
An example submission in the correct format is given below.
<?xml version="1.0"?>
<!DOCTYPE inex-snippet-submission SYSTEM "inex-snippet-submission.dtd">
<inex-snippet-submission participant-id="20" run-id="Qut_Snippet_Run_01">
<description>A description of the approach used.</description>
<topic topic-id="2011001">
<snippet doc-id="16080300" rsv="0.9999">...</snippet>
<snippet doc-id="16371300" rsv="0.9998">...</snippet>
...
</topic>
<topic topic-id="2011002">
<snippet doc-id="1686300" rsv="0.9999">...</snippet>
<snippet doc-id="1751300" rsv="0.9997">...</snippet>
...
</topic>
...
</inex-snippet-submission>
Relevance assessments
Documents for each topic will be manually assessed for relevance based on the snippets alone, as the goal is to determine the effectiveness of the snippet’s ability to provide sufficient information about the document behind the snippet. Each topic within a submission will be assigned an assessor, who will assess each document within the ranked list, based on the snippet alone, until a predetermined number of relevant documents are found, or until there are no documents remaining. The assignment of assessors to topics will be shuffled for each submission to avoid bias introduced by assessors judging the same topic twice, and to allow each submission to be judged by multiple assessors.
It is hoped that crowd-sourcing may be used for assessment; however
participants should also expect to perform a number of assessments, depending on the number of runs they submit. As the assessment is based on short snippets rather than whole documents, the assessment load will be relatively light.
Download the assessment tool (with instructions)
Evaluation
The primary evaluation metric, and the one which determines the ranking, is the geometric mean of recall and negative recall (GM), averaged over all topics: sqrt(TN/(TN+FP) * TP/(TP+FN)).
Other metrics include:
- Mean prediction accuracy (MPA) – the percentage of results the assessor correctly assessed, averaged over all topics: (TP+TN)/(TP+FN+TN+FP)
- Mean normalised prediction accuracy (MNPA) – the average of the relevant results correctly assessed, and the irrelevant results correctly assessed, averaged over all topics: 0.5*TP/(TP+FN) + 0.5*TN/(TN+FP)
- Recall – the percentage of relevant documents correctly assessed, averaged over all topics: TP/(TP+FN)
- Negative recall (NR) or specificity – the percentage of irrelevant documents correctly assessed, averaged over all topics: TN/(TN+FP)
- Positive agreement (PA) – the conditional probability of agreement between snippet assessor and document assessor (i.e. ground truth), given that one of the two judged relevant. Equivalent to F1 score: 2*TP/(2*TP+FP+FN)
- Negative agreement (NA) – the conditional probability of agreement between snippet assessor and document assessor (i.e. ground truth), given that one of the two judged irrelevant: 2*TN/(2*TN+FP+FN)
Schedule
- July 15: Release of training topics.
- Aug 13: Release of final topics.
- Oct 7: Submission deadline.
- Nov 11: Relevance assessments due.
- Nov 18: Release of evaluation results.
Organizers
Shlomo Geva
QUT
s.geva@qut.edu.au
Andrew Trotman
University of Otago
andrew@cs.otago.ac.nz
Falk Scholer
RMIT
falk.scholer@rmit.edu.au
Mark Sanderson
RMIT
mark.sanderson@rmit.edu.au
Matthew Trappett
QUT
matthew.trappett@qut.edu.au