|
INEX 2012 Tweet Contexualization Track
|
|
|
Last News
July, 27 2012 : All results released at http://qa.termwatch.es/data/ (login information here).
Overview
Participants are solicited for the tweet contextualization INEX task at CLEF 2012. The use case of this new task is the following: given a new tweet, the system must provide some context about the subject of the tweet, in order to help the reader to understand it. This context should take the form of a readable summary, not exceeding 500 words, composed of passages from a provided Wikipedia corpus.
The results of the evaluation campaign will be disseminated at the final workshop which will be organized in conjunction with the CLEF 2012 conference, 17–20 September in Rome, Italy.
Like in 2011 QA@INEX, the task to be performed by the participating groups is contextualizing tweets,
i.e. answering questions of the form
"what is this tweet about?" using a recent cleaned dump of the Wikipedia. The
general process involves:
- tweet analysis,
- passage and/or XML elements
retrieval,
- construction of the answer.
We regard as relevant passages segments that both
- contain relevant information but
- contain as little non-relevant information as possible (the
result is specific to the question).
Test data
About 1000 tweets in English were collected by the organizers from Twitter®. They were selected among informative accounts (for example, @CNN, @TennisTweets, @PeopleMag, @science...), in order to avoid purely personal tweets that could not be contextualized. Information such as the user name, tags or URLs will be provided. Theses tweets are now available in two formats:
- a full JSON format with all tweet metadata:
"created_at":"Fri, 03 Feb 2012 09:10:20 +0000",
"from_user":"XXX",
"from_user_id":XXX,
"from_user_id_str":"XXX",
"from_user_name":"XXX",
"geo":null,
"id":XXX,
"id_str":"XXX",
"iso_language_code":"en",
"metadata":{"result_type":"recent"},
"profile_image_url":"http://XXX",
"profile_image_url_https":"https://XXX",
"source":"",
"text":"blahblahblah",
"to_user":null,
"to_user_id":null,
"to_user_id_str":null,
"to_user_name":null
- a two-column text format with only tweet id and tweet text, for people not wanting to bother with JSON format:
170167036520038400 "What links human rights, biodiversity and habitat loss, deforestation, pollution, pesticides, Rio +20 and and a sustainable future for all?"
A sample set of tweets is available:
The complete test set is available from here for all INEX 2012 active participants.
Document collection
The document collection has been rebuilt based on a recent dump of the English Wikipedia from November 2011. Since we target a plain XML corpus for an easy extraction of plain text answers, we removed all notes and bibliographic references that are difficult to handle and kept only non empty Wikipedia pages (pages having at least on section).
Resulting documents are made of a title (title), an abstract (a) and sections (s). Each section has a sub-title (h). Abstract end sections are made of paragraphs (p) and each paragraph can have entities (t) that refer to Wikipedia pages. Therefore the resulting corpus has this simple DTD:
This corpus is available from here for all INEX 2012 active participants in two file formats:
- a single xml file (gzip compression)
- the same file split into 1000 folders, one file per page (tgz archive)
Baseline system
A baseline XML-element retrieval system powered by Indri is available online with a standard CGI interface. The index covers all words (no stop list, no stemming) and all XML tags. Participants that do not wish to build their own index could use this one by downloading it or by using it online (More information here or contact eric.sanjuan@univ-avignon.fr).
You can also query this baseline system in batch mode using the perl APIs here for 2011 and 2012 document collections. See their synopsis for more details.
Evaluation
The summaries will be evaluated according to:
- informativeness: the way they overlap with relevant passages (number of them, vocabulary and bi-grams included or missing). For each tweet, all passages from all participants will be merged and displayed to the assessor in alphabetical order. Therefore, each passage informativeness will be evaluated independently from others, even in the same summary. Assessors will only have to provide a binary judgment on whether the passage is worth appearing in a summary on the topic, or not.
- readability assessed by evaluators and participants. Each participant will have to evaluate readability for a pool of summaries on an online web interface. Each summary consists in a set of passages and for each passage, assessors will have to tick four kinds of check boxes:
- Syntax (S): tick the box if the passage contains a syntactic problem (bad segmentation for example),
- Anaphora (A): tick the box if the passage contains an unsolved anaphora,
- Redundancy (R): tick the box if the passage contains a redundant information, i.e. an information that has already been given in a previous passage,
- Trash (T): tick the box if the passage does not make any sense in its context (i.e. after reading the previous passages). These passages must then be considered as trashed, and readability of following passages must be assessed as if these passages were not present.
Result Submission
Participants can submit up to 3 runs. One run out of the 3 should be completely automatic: participants must use only the Wikipedia dump and possibly their own resources (even if the texts of tweets sometimes contain URLs, the Web must not be used as a resource). Manual runs are welcome whenever any human intervention is clearly documented. That is, a participant cannot submit more than 6 runs in total.
A submitted summary will have the following format:
Q0
Q0
Q0
...
where:
- the first column is the tweet id (id field of the JSON format).
- the second column currently unused and should always be Q0.
- the third column is the file name (without .xml) from which a result is retrieved, which is identical to the of the Wikipedia document.
- the fourth column is the rank the result is retrieved, and fifth column shows the score (integer or floating point).
- the sixth column is called the "run tag" and should be a unique identifier for your group AND for the method used.
- The seventh column is the raw text of the Wikipedia passage. Text is given without XML tags and without formatting characters (avoid "\n","\r","\l"). The resulting word sequence has to appear in the file indicated in the third field. Here is an example of such output:
167999582578552 Q0 3005204 1 0.9999 I10UniXRun1 The Alfred Noble Prize is an award presented by the combined engineering societies of the United States, given each year to a person not over thirty-five for a paper published in one of the journals of the participating societies.
167999582578552 Q0 3005204 2 0.9998 I10UniXRun1 The prize was established in 1929 in honor of Alfred Noble, Past President of the American Society of Civil Engineers.
167999582578552 Q0 3005204 3 0.9997 I10UniXRun1 It has no connection to the Nobel Prize , although the two are often confused due to their similar spellings.
Schedule
March, 1 2012 Online Registrations to the Labs opens.
March, 26 2012 Test set released here.
June, 1 to June, 15 2012 Run submissions here (use QA track form).
June, 30 2012 Manual evaluation of readability by participants launched here (logins sent on July 3rd).
July, 15 2012 Results of informativeness evaluation by organizers released to participants here.
July, 25 2012 End of readability evaluation by participants
July, 27 2012 Results of readability evaluation by participants released to participants here.
August 17 2012 Submission of CLEF 2012 Working Notes papers
August 24 2012 Submission of CLEF 2012 Labs Overviews
September 17-20 2012 CLEF 2012 Conference
Patrice Bellot, LSIS - Aix-Marseille University
Josiane Mothe, IRIT, University of Toulouse
Véronique Moriceau, LIMSI-CNRS, University Paris-Sud
Eric SanJuan, LIA, University of Avignon
Xavier Tannier, LIMSI-CNRS, University Paris-Sud