INEX 2013 Tweet Contextualization Track
2014/05/23: Release of informativeness evaluation postponed to Monday 25 May
2014/05/15: Late runs will be accepted until 20 May
2014/05/09: Run submission deadline extension 16 May
About 340 M of tweets are written every day. However, 140 characters long messages are rarely self-content. The Tweet Contextualization aims at providing automatically information - a summary that explains the tweet. This requires combining multiple types of processing from information retrieval to multi-document summarization including entity linking.
Running since 2010, results show that best systems combine passage retrieval, sentence segmentation and scoring, named entity recognition, and text POS analysis. Anaphora detection and diversity content measure as well as sentence reordering can also help.
Evaluation considers both informativeness and readability. Informativeness is measured as a variant of absolute log-diff between term frequencies in the text reference and the proposed summary.
Use case in 2014
The task in 2014 is a slight variant of previous ones and it is complementary to CLEF RepLab. Using the same cleaned dump of the Wikipedia as in 2013, the new use case of the task is the following: given a tweet AND a related entity, the system must provide some context about the subject of the tweet from the perspective of the entity, in order to help the reader to understand it, i.e. answering questions of the form
"why this tweet concerns the entity? should it be an alert?". Like in previous editions, the general process involves:
- tweet analysis,
- passage and/or XML elements
- construction of the answer.
This context should take the form of a readable summary, not exceeding 500 words, composed of passages from a provided Wikipedia corpus.
We regard as relevant summaries extracts from the wikipedia that both
- contain relevant information but
- contain as little non-relevant information as possible (the
result is specific to the relationship between the tweet and the entity).
Topics for 2014
A small set of 240 tweets in English have been selected by the organizers from CLEF RepLab 2013 together with their related entity. These tweets have at least 80 characters and do not contain urls in order to focus on content analysis.
RepLab provides several annotations for tweets, we selected three types of them: the category (4 distinct), an entity name from the wikipedia (64 distinct) and a manual topic label (235 distinct). The entity name should be used as an entry point into wikipedia or DbPedia and gives the contextual perspective. The usefulness of topic labels for this automatic task is an open question at this moment because of their variety.
2014 topics are now available on the document repository in XML or txt tabulated format.
(Login information available here).
Document collection for 2013 and 2014
Since tweets are from 2013, the document collection is the same as in 2013 and has been released at http://qa.termwatch.es/data. Password is available here for all INEX participants.
This document collection has been rebuilt in 2013 based on a dump of the English Wikipedia from November 2012. Since we target a plain XML corpus for an easy extraction of plain text answers, we removed all notes and bibliographic references that are difficult to handle and kept only non empty Wikipedia pages (pages having at least on section).
Resulting documents are made of a title (title), an abstract (a) and sections (s). Each section has a sub-title (h). Abstract end sections are made of paragraphs (p) and each paragraph can have entities (t) that refer to Wikipedia pages. Therefore the resulting corpus has this simple DTD:
<!ELEMENT xml (page)+>
<!ELEMENT page (ID, title, a, s*)>
<!ELEMENT ID (#PCDATA)>
<!ELEMENT title (#PCDATA)>
<!ELEMENT a (p+)>
<!ELEMENT s (h, p+)>
<!ATTLIST s o CDATA #REQUIRED>
<!ELEMENT h (#PCDATA)>
<!ELEMENT p (#PCDATA | t)*>
<!ATTLIST p o CDATA #REQUIRED>
<!ELEMENT t (#PCDATA)>
<!ATTLIST t e CDATA #IMPLIED>
A baseline XML-element retrieval system powered by Indri is available for participants online with a standard CGI interface. The index covers all words (no stop list, no stemming) and all XML tags. Participants that do not wish to build their own index could use this one by downloading it or by using it online. A perl APIs is also available.
More information here or contact email@example.com.
The summaries will be evaluated according to:
- informativeness: the way they overlap with relevant passages (number of them, vocabulary and bi-grams included or missing). For each tweet, all passages from all participants will be merged and displayed to the assessor in alphabetical order. Therefore, each passage informativeness will be evaluated independently from others, even in the same summary. Assessors will only have to provide a binary judgment on whether the passage is worth appearing in a summary on the topic, or not.
- readability assessed by evaluators and participants. Each participant will have to evaluate readability for a pool of summaries on an online web interface. Each summary consists in a set of passages and for each passage, assessors will have to tick four kinds of check boxes:
- Syntax (S): tick the box if the passage contains a syntactic problem (bad segmentation for example),
- Anaphora (A): tick the box if the passage contains an unsolved anaphora,
- Redundancy (R): tick the box if the passage contains a redundant information, i.e. an information that has already been given in a previous passage,
- Trash (T): tick the box if the passage does not make any sense in its context (i.e. after reading the previous passages). These passages must then be considered as trashed, and readability of following passages must be assessed as if these passages were not present.
Participants can submit up to 3 runs. One run out of the 3 should be completely automatic. Manual runs are welcome whenever any human intervention is clearly documented.
A submitted summary must contain only passages from the document collection (November 2012 Wikipedia dump of articles in English) and will have the following format:
167999582578552 Q0 3005204 1 0.9999 I10UniXRun1 The Alfred Noble Prize is an award presented by the combined engineering societies of the United States, given each year to a person not over thirty-five for a paper published in one of the journals of the participating societies.
167999582578552 Q0 3005204 3 0.9997 I10UniXRun1 It has no connection to the Nobel Prize, although the two are often confused due to their similar spellings.
167999582578552 Q0 3005204 2 0.9998 I10UniXRun1 The prize was established in 1929 in honor of Alfred Noble, Past President of the American Society of Civil Engineers.
- the first column is the tweet id (id field of the JSON format).
- the second column currently unused and should always be Q0.
- the third column is the file name (without .xml) from which a result is retrieved, which is identical to the of the Wikipedia document. Alternatively, the wikipedia page title can also be used.
- the fourth column is the position of the passage in the summary, regardless its informativeness. This is important for readability evaluation where summary structure is considered.
- the fifth column shows the score (integer or floating point) that should reflect the estimated informativeness of the passage. This score is used in the pooling process to build informativeness q-rels.
- the sixth column is called the "run tag" and should be a unique identifier for your group AND for the method used.
- The seventh column is the raw text of the Wikipedia passage. Text is given without XML tags and without formatting characters (avoid "\n","\r","\l"). The resulting word sequence has to appear in the file indicated in the third field. Here is an example of such output:
Patrice Bellot, LSIS - Aix-Marseille University
- Topics and task guidelines released: 1 April
- Run submission deadline English topics: from 10 May (initial) to 20 May (extended)
- Informativeness Evaluation results sent out: postponed to 25 May (initial delay 21 May)
- Readability Evaluation results sent out: 1 June
- Participant papers (CLEF proceedings) due: 7 June.
- Overview paper due: 30 June
Josiane Mothe, IRIT, University of Toulouse
Véronique Moriceau, LIMSI-CNRS, University Paris-Sud
Eric SanJuan, LIA, University of Avignon
Xavier Tannier, LIMSI-CNRS, University Paris-Sud