INEX 2011 Question Answering Track

Home | About | 2012 | 2013 | 2014

Overview

The QA task to be performed by the participating groups of INEX 2011 is contextualizing tweets, i.e. answering questions of the form "what is this tweet about?" using a recent cleaned dump of the Wikipedia. The general process involves:

tweet analysis,
passage and/or XML elements retrieval,
construction of the answer.

We regard as relevant passages segments that both

contain relevant information but
contain as little non-relevant information as possible (the result is specific to the question).

For evaluation purposes, we require that the answer uses ONLY elements or passages previously extracted from the document collection. The correctness of answers is established by participants exclusively based on the support passages and documents.

Participants are required to submit at least one completely automatic run. However, manual runs are strongly encouraged. Are considered as manual, runs that require a human intervention at any level of the process. These interventions should be clearly stated and documented.

Task description

Motivation for the Task

The underlying scenario is when receiving a tweet with an url on a small terminal like a phone, provide the user with synthetic contextual information grasped from a local XML dump of the wikipedia. The answer needs to be built by aggregation of relevant XML elements or passages.

The aggregated answers will be evaluated according to the way they overlap with relevant passages (number of them, vocabulary and bi-grams included or missing) and the "last point of interest" marked by evaluators. By combining these measures, we expect to take into account both the informative content and the readability of the aggregated answers.

Results to Return

An short summary of less that 500 words, exclusively made of aggregated passages extracted from the wikipedia corpus.

Automatic summarization systems by extraction are strongly encouraged to participate.

Relevance assessments

Each assessor will have to evaluate a pool of answers of a maximum of 500 words each. These answers will be an agglomeration of wikipedia passages.

Evaluators will have to mark:

The "last point of interest", i.e. the first point after which the text becomes out of context because of:
- syntactic incoherence
- unsolved anaphora
- redundancy
- not answering the question
all relevant passages in the text, even if they are redundant.

Systems will be ranked according to the:

length in words from the beginning of the answer to the "last point of interest".
distributional similarities between the whole answer and the concatenation of all relevant passages from all assessors using the FRESA package.

Document collection

The document collection has been rebuilt based on a recent dump of the English wikipedia from April 2011 (we left a copy of this dump here). Since we target a plain xml corpus for an easy extraction of plain text answers, we removed all notes and bibliographic references that are difficult to handle and kept only the 3,217,015 non empty wikipedia pages (pages having at least on section).

Resulting documents are made of a title (title), an abstract (a) and sections (s). Each section has a sub-title (h). Abstract end sections are made of paragraphs (p) and each paragraph can have entities (t) that refer to wikipedia pages. Therefore the resulting corpus has this simple DTD:

<!ELEMENT xml (page)+> <!ELEMENT page (ID, title, a, s*)> <!ELEMENT ID (#PCDATA)> <!ELEMENT title (#PCDATA)> <!ELEMENT a (p+)> <!ELEMENT s (h, p+)> <!ATTLIST s o CDATA #REQUIRED> <!ELEMENT h (#PCDATA)> <!ELEMENT p (#PCDATA | t)*> <!ATTLIST p o CDATA #REQUIRED> <!ELEMENT t (#PCDATA)> <!ATTLIST t e CDATA #IMPLIED> This corpus is available in two file formats (2.7 Go each):

A complementary list of non wikipedia entities extracted from pages using LIMSI tools is also available here.

Baseline system

A baseline XML-element retrieval system powered by Indri is available online with a standard CGI interface. The index covers all words (no stop list, no stemming) and all XML tags. Participants that do not wish to build their own index could use this one by downloading it or by using it online (More information here or contact eric.sanjuan@univ-avignon.fr).

You can also query this baseline system in batch mode using this perl program. It uses input files as this one. See its synopsis for more details.

Topics

2011 and on ...

The selected 132 topics for 2011 are available here. Each topic includes the title and the first sentence of a New York Times paper that were twitted at least two months after the wikipedia dump we use. For each topic we manually checked that there is related information in the document collection. We can provide the content of the papers to participants on an individual basis but the objective of the task remains to contextualize only the twitted information.

Past topics

2009 - 2010 topics also available here and anonymized best runs for 2010 participants are also available here. These runs can be used to smooth new systems.

Result Submission

Fact sheet

Participants can submit up to 3 runs. One run out of the 3 should be completely automatic, using only available public resources. Manual runs are welcomed whenever any human intervention is clearly documented. That is, a participant cannot submit more than 6 runs in total.
Submitted XML elements and/or passages up to 500 words in total. The passages will be read top down and only the 500 first words will be considered for evaluation and may not be overlapping for the same topic.
we consider as a single word any string of alphanumeric characters without spaces or punctuations.

Format for results

It is a variant of the familiar TREC format with additional fields:

<qid> Q0 <file> <rank> <rsv> <run_id> <column_7> <column_8> <column_9>

Here:

the first column is the topic number.
the second column currently unused and should always be Q0.
the third column is the file name (without .xml) from which a result is retrieved, which is identical to the <id> of the Wikipedia document.
the fourth column is the rank the result is retrieved, and fifth column shows the score (integer or floating point) that generated the ranking.
the sixth column is called the "run tag" and should be a unique identifier for your group AND for the method used.

The remaining three columns depend on the chosen format (text passage or offset).

Textual content:

raw text is given without XML tags and without formatting characters (avoid "\n","\r","\l"). The resulting word sequence has to appear in the file indicated in the third field. This is an example of such output:

1 Q0 3005204 1 0.9999 I10UniXRun1 The Alfred Noble Prize is an award presented by the combined engineering societies of the United States, given each year to a person not over thirty-five for a paper published in one of the journals of the participating societies. 
1 Q0 3005204 2 0.9998 I10UniXRun1 The prize was established in 1929 in honor of Alfred Noble, Past President of the American Society of Civil Engineers.
1 Q0 3005204 3 0.9997 I10UniXRun1 It has no connection to the Nobel Prize , although the two are often confused due to their similar spellings.

File Offset Length format (FOL)

In this format passages are given as offset and length calculated in characters with respect to the textual content (ignoring all tags) of the XML file. File offsets start counting a 0 (zero). Previous example would be the following in FOL format:

1 Q0 3005204 1 0.9999 I10UniXRun1 256 230
1 Q0 3005204 2 0.9998 I10UniXRun1 488 118
1 Q0 3005204 3 0.9997 I10UniXRun1 609 109

The results are from article 3005204. The first passage starts at the 256th character (so 257 characters beyond the first character), and has a length of 239 characters.

Schedule

18/Oct/2011	Release of final set of questions available here.
~~18/Nov/2011~~ 28/Nov/2011 (Extended)	Submission deadline for Results
~~21/Nov/2011~~ 10/Dec/2011	Release of QA semi-automatic evaluation results by organizers
10/Dec/2011	Release of manual evaluation by participants

Results

The evaluation software is here.

15/Dec/2011: Results for informativeness
30/Jan/2011: Results for readability:
- Result overview
- List of all readability assessments
- List of passages ids assessed for your own runs (needed to analyze the list above)

Organizers

Patrice Bellot
LSIS - Aix-Marseille University

Josiane Mothe
IRIT, University of Toulouse

Véronique Moriceau
LIMSI-CNRS, University Paris-Sud 11

Eric SanJuan
LIA, University of Avignon
eric.sanjuan@univ-avignon.fr

Xavier Tannier
LIMSI-CNRS, University Paris-Sud 11

Imprint | Data protection | Contact someone about INEX