Overview

The goal of the snippet retrieval track is to determine how best to generate informative snippets for search results. Such snippets should provide sufficient information to allow the user to determine the relevance of each document to their query, without needing to view the document itself. Participating organisations will compare the effectiveness of their generated snippets to others'.

For the convenience of those who have participated in the Snippet Retrieval track in either 2011 or 2012, there is a list of changes at the end of this document.

Task

A set of topics (or queries) has been provided, and each one has a corresponding set of search results, taken from the document collection (described below). The task is to automatically generate text snippets for each of these search results, the goal being to provide sufficient information for the user to determine the relevance of the underlying document.

Each run will be submitted in the form of an XML file (format described below). Each submission should contain the exact same documents as the provided reference run.

Each snippet may contain a maximum of 180 characters – any snippets longer than this will be truncated. The snippets may be created in any way that you wish – they may consist of summaries, passages from the document, or any other text at all. Note that the document title will be shown alongside the snippet in the assessment software, so it is not necessary to include it within the snippet itself (although this is not explicitly disallowed).

Participants may submit more than one submission; however submissions must be ranked in order of importance, as it is possible that not all submissions may be able to be evaluated. Please note that participants will have to assess an additional snippet assessment package for each submission received.

Data

This year, the Snippet Retrieval track will be using the same document collection as the Tweet Contextualisation track, based on a dump of the English Wikipedia from November 2012. A link to the full dataset, with user name and password, is given here

The set of topics is the same as used in 2012. There are 35 topics in total, and the reference run contains 20 results for each topic. Since the task is to generate snippets for the documents given in the reference run, a link to an archive containing only those 700 documents (as well as the reference run submission file itself) can be found here.

The documents are in a simple XML format. Documents consist of a title (title), an abstract (a) and sections (s). Each section has a sub-title (h). Abstract and sections are made of paragraphs (p) and each paragraph can have entities (t) that refer to Wikipedia pages. The DTD is given below:

<!ELEMENT xml (page)+>  
<!ELEMENT page (ID, title, a, s*)>  
<!ELEMENT ID (#PCDATA)> 
<!ELEMENT title (#PCDATA)>  
<!ELEMENT a (p+)>  
<!ELEMENT s (h, p+)>  
<!ATTLIST s o CDATA #REQUIRED>  
<!ELEMENT h (#PCDATA)> 
<!ELEMENT p (#PCDATA | t)*>  
<!ATTLIST p o CDATA #REQUIRED>  
<!ELEMENT t (#PCDATA)>  
<!ATTLIST t e CDATA #IMPLIED>

Submission format

Note: the reference run provided is already in the correct format. To generate a valid submission, only the snippet text itself needs to be modified, as well as the metadata describing the run (participant-id, run-id, and description).

The DTD for the submission format is as follows.

<!ELEMENT inex-snippet-submission (description,topic+)>
<!ATTLIST inex-snippet-submission
  participant-id CDATA #REQUIRED
  run-id CDATA #REQUIRED
>
<!ELEMENT description (#PCDATA)>
<!ELEMENT topic (snippet+)>
<!ATTLIST topic
  topic-id CDATA #REQUIRED
>
<!ELEMENT snippet (#PCDATA)>
<!ATTLIST snippet
  doc-id CDATA #REQUIRED
  rsv CDATA #REQUIRED
>

Each submission must contain the following:

participant-id: The participant number of the submitting institution (available here)
run-id: A run ID, which must be unique across all submissions sent from a single participating organisation.
description: A short description of the approach used.

Every run should contain the results for each topic, conforming to the following:

topic: Contains a ranked list of snippets, ordered by decreasing level of relevance of the document which they represent.
topic-id: The ID number of the topic.
snippet: A snippet representing a document (defined by doc-id). Must be 300 characters or less, and contain no line breaks.
doc-id: The document ID. These must be the same as in the reference run.
rsv: The document's score. This is already provided in the reference run.

A correct submission will look similar to this:

<?xml version="1.0"?>
<!DOCTYPE inex-snippet-submission SYSTEM "inex-snippet-submission.dtd">
<inex-snippet-submission participant-id="20" run-id="QUT_Snippet_Run_01">
    <description>A description of the approach used.</description>
    <topic topic-id="2013001">
        <snippet doc-id="7286939" rsv="0.9999">...</snippet>
        <snippet doc-id="1760504" rsv="0.9998">...</snippet>
        ...
    </topic>
    <topic topic-id="2013002">
        <snippet doc-id="11733666" rsv="0.9999">...</snippet>
        <snippet doc-id="3659889" rsv="0.9997">...</snippet>
        ...
    </topic>
    ...
</inex-snippet-submission>

Relevance assessment

Participating organisations will be required to perform assessment on other participants' submissions. Both snippet-based and document-based assessment will be used, with evaluation based on comparing these two sets of assessments.

For each submission received from a participating organisation, that organisation will be given a snippet assessment package (the size of a single submission) to assess. For each topic, the assessor will read through the details of the topic, after which they will read through each snippet, and determine whether or not the underlying document is relevant to the topic. This is expected to take around 1-2 hours per package. Ideally, each package should be assessed by a different person if feasible.

Additionally, each participating organisation will be required to perform one assessment of the document assessment package. For each of the 35 topics, the assessor is shown the full text of each of the 20 documents. They must read through enough of the document to determine whether or not it is relevant to the topic. This is expected to take around 3-7 hours, depending on the assessor.

Only one set of document assessment needs to be completed by each participating group, although additional assessments are welcome. Please note, however, that if a given assessor is performing both snippet assessment and document assessment, the document assessment must be performed last, to avoid any bias caused by familiarity with the full documents.

Evaluation

Evaluation is based on comparing the snippet assessments with the consensus formed by all of the the submitted document assessments, which is treated as a ground truth.

The final set of evaluation metrics to be used is still to be decided, but will include at least the following:

Geometric mean of recall and negative recall (GM), averaged over all topics: sqrt(TN/(TN+FP) * TP/(TP+FN)).
Mean prediction accuracy (MPA) � the percentage of results the assessor correctly assessed, averaged over all topics: (TP+TN)/(TP+FN+TN+FP)
Mean normalised prediction accuracy (MNPA) � the average of the relevant results correctly assessed, and the irrelevant results correctly assessed, averaged over all topics: 0.5*TP/(TP+FN) + 0.5*TN/(TN+FP)
Recall � the percentage of relevant documents correctly assessed, averaged over all topics: TP/(TP+FN)
Negative recall (NR) or specificity � the percentage of irrelevant documents correctly assessed, averaged over all topics: TN/(TN+FP)
Positive agreement (PA) � the conditional probability of agreement between snippet assessor and document assessor (i.e. ground truth), given that one of the two judged relevant. Equivalent to F1 score: 2*TP/(2*TP+FP+FN)
Negative agreement (NA) � the conditional probability of agreement between snippet assessor and document assessor (i.e. ground truth), given that one of the two judged irrelevant: 2*TN/(2*TN+FP+FN)

Schedule

May 24: Submission deadline.
May 27: Assessment packages released
June 7: Relevance assessments due.
June 10: Release of evaluation results.

Changes from previous years

For the benefit of those who have participated in the Snippet Retrieval track in either of the previous two years that it has been run, the following lists outline how this year's track compares to each of the previous years.

Comparison to 2012 track

New document collection: This year, we will be using the same document collection as the Tweet Contextualisation track. This is still based on the English Wikipedia, but it uses a much newer dump (Nov 2012, rather than Oct 2008), and uses a much simpler and cleaner XML format.
Mandatory reference run: All submissions are required to generate snippets based on the documents given in the reference run. This will simplify the assessment process, and places more emphasis on snippet retrieval rather than document retrieval.
Reduced assessment load: Each group will be assigned a snippet assessment package for every run they submit; however, each group will only have to perform full document assessment once.

Comparison to 2011 track

New document collection: This year, we will be using the same document collection as the Tweet Contextualisation track. This is still based on the English Wikipedia, but it uses a much newer dump (Nov 2012, rather than Oct 2008), and uses a much simpler and cleaner XML format.
New set of topics: A new set of topics was developed for the 2012 track, and is being used again this year. It contains a mix of new topics, and topics recycled from past Ad Hoc tracks.
Mandatory reference run: All submissions are required to generate snippets based on the documents given in the reference run. This will simplify the assessment process, and places more emphasis on snippet retrieval rather than document retrieval.
Full document-based assessment: Because we are using a new set of topics and dataset, there is not an existing set of document assessments to use as a ground truth. Each group will be required to perform document assessment once only. This is expected to take around 3-7 hours.
Reduced assessment load: To keep the assessment load manageable now that document-based assessment is included, the number of topics and snippets has been reduced:
- There are 35 topics in total (down from 50 in 2011)
- There will be 20 snippets per topic (down from 100 in 2011)
Each snippet assessment package is expected to take 1-2 hours.
Document titles included: The document title will be shown alongside each snippet in the assessment software – it will not be necessary to include the document title in the snippet text itself (although this is not forbidden).
Shorter snippets: Snippets are now limited to 180 characters (down from 300 in 2011).
Baseline run: There will be a baseline run included in the evaluation, consisting of the first 180 characters of each document in the reference run; there is no need to generate such a run yourself (although this is not forbidden).

Organizers

Matthew Trappett
QUT
matthew.trappett@qut.edu.au
Shlomo Geva
QUT
s.geva@qut.edu.au

Andrew Trotman
University of Otago
andrew@cs.otago.ac.nz

Falk Scholer
RMIT
falk.scholer@rmit.edu.au

Mark Sanderson
RMIT
mark.sanderson@rmit.edu.au

Imprint | Data protection | Contact someone about INEX