Last news

2013/06/02: Readability evaluation results by organizers
2013/05/26: Informativity evaluation results by organizers
2013/05/01: Spanish subtrack : corpus and topics released.
2013/04/20: Run submission (English topics) deadline extension May 1st
2013/03/01: 2013 Topics available on the document repository
2013/02/18: Corpus released

Overview

About 340 M of tweets are written every day. However, 140 characters long messages are rarely self-content. The Tweet Contextualization aims at providing automatically information - a summary that explains the tweet. This requires combining multiple types of processing from information retrieval to multi-document summarization including entity linking.

Running since 2010, results show that best systems combine passage retrieval, sentence segmentation and scoring, named entity recognition, and text POS analysis. Anaphora detection and diversity content measure as well as sentence reordering can also help.

Evaluation considers both informativeness and readability. Informativeness is measured as a variant of absolute log-diff between term frequencies in the text reference and the proposed summary. Maximal informativeness scores obtained by participants from 19 different groups is between 10% and 14%. There is also room for improving readability.

In 2013, the goal of the task and the evaluation metrics remain unchanged but tweet diversity has been improved. More specially, a significant part of tweets with hashtags have been included in the tweet set. Hashtags are authors' annotation on key terms of their tweets. In the two past years, hashtags have been underused, unless they are core components of tweets.

The results of the evaluation campaign will be disseminated at the final workshop which will be organized in conjunction with the CLEF 2013 conference.

Use case

Like in 2012, the use case of this task is the following: given a new tweet, the system must provide some context about the subject of the tweet, in order to help the reader to understand it, i.e. answering questions of the form "what is this tweet about?" using a recent cleaned dump of the Wikipedia. The general process involves:

tweet analysis,
passage and/or XML elements retrieval,
construction of the answer.

This context should take the form of a readable summary, not exceeding 500 words, composed of passages from a provided Wikipedia corpus.

We regard as relevant summaries extracts from the wikipedia that both

contain relevant information but
contain as little non-relevant information as possible (the result is specific to the question).

Topics

598 tweets in English have been collected by the organizers from Twitter(R). They were selected among informative accounts (for example, @CNN, @TennisTweets, @PeopleMag, @science...), in order to avoid purely personal tweets that could not be contextualized. Information such as the user name, tags or URLs are provided in JSON format. These tweets are available in a single xml file with three fields: topic, title and txt. The topic field contains the tweet id as attribute, the title field shows the tweet text, for people not wanting to bother with JSON format and the txt field contains full JSON format with all tweet metadata:

"created_at":"Fri, 03 Feb 2012 09:10:20 +0000", "from_user":"XXX", "from_user_id":XXX, "from_user_id_str":"XXX", "from_user_name":"XXX", "geo":null, "id":XXX, "id_str":"XXX", "iso_language_code":"en", "metadata":{"result_type":"recent"}, "profile_image_url":"http://XXX", "profile_image_url_https":"https://XXX", "source":"<a href='http://XXX'>", "text":"blahblahblah", "to_user":null, "to_user_id":null, "to_user_id_str":null, "to_user_name":null

2013 topics are now available on the document repository (Login information available here).

Document collection

The document collection has been released at http://qa.termwatch.es/data. Password is available here for all INEX participants.

As for previous seditions, document collection has been rebuilt based on a recent dump of the English Wikipedia from November 2012. Since we target a plain XML corpus for an easy extraction of plain text answers, we removed all notes and bibliographic references that are difficult to handle and kept only non empty Wikipedia pages (pages having at least on section).

Resulting documents are made of a title (title), an abstract (a) and sections (s). Each section has a sub-title (h). Abstract end sections are made of paragraphs (p) and each paragraph can have entities (t) that refer to Wikipedia pages. Therefore the resulting corpus has this simple DTD:

<!ELEMENT xml (page)+> <!ELEMENT page (ID, title, a, s*)> <!ELEMENT ID (#PCDATA)> <!ELEMENT title (#PCDATA)> <!ELEMENT a (p+)> <!ELEMENT s (h, p+)> <!ATTLIST s o CDATA #REQUIRED> <!ELEMENT h (#PCDATA)> <!ELEMENT p (#PCDATA | t)*> <!ATTLIST p o CDATA #REQUIRED> <!ELEMENT t (#PCDATA)> <!ATTLIST t e CDATA #IMPLIED>

Tools

A baseline XML-element retrieval system powered by Indri is available for participants online with a standard CGI interface. The index covers all words (no stop list, no stemming) and all XML tags. Participants that do not wish to build their own index could use this one by downloading it or by using it online. A perl APIs is also available. More information here or contact eric.sanjuan@univ-avignon.fr.

Evaluation

The summaries will be evaluated according to:

informativeness: the way they overlap with relevant passages (number of them, vocabulary and bi-grams included or missing). For each tweet, all passages from all participants will be merged and displayed to the assessor in alphabetical order. Therefore, each passage informativeness will be evaluated independently from others, even in the same summary. Assessors will only have to provide a binary judgment on whether the passage is worth appearing in a summary on the topic, or not.
readability assessed by evaluators and participants. Each participant will have to evaluate readability for a pool of summaries on an online web interface. Each summary consists in a set of passages and for each passage, assessors will have to tick four kinds of check boxes:
- Syntax (S): tick the box if the passage contains a syntactic problem (bad segmentation for example),
- Anaphora (A): tick the box if the passage contains an unsolved anaphora,
- Redundancy (R): tick the box if the passage contains a redundant information, i.e. an information that has already been given in a previous passage,
- Trash (T): tick the box if the passage does not make any sense in its context (i.e. after reading the previous passages). These passages must then be considered as trashed, and readability of following passages must be assessed as if these passages were not present.

Result Submission

Participants can submit up to 3 runs. One run out of the 3 should be completely automatic: participants must use only the Wikipedia dump and possibly their own resources (even if the texts of tweets sometimes contain URLs, the Web must not be used as a resource). Manual runs are welcome whenever any human intervention is clearly documented.

A submitted summary must contain only passages from the document collection (November 2012 Wikipedia dump of articles in English) and will have the following format:

<tid> Q0 <file> <rank> <rsv> <run_id> <text of passage 1> <tid> Q0 <file> <rank> <rsv> <run_id> <text of passage 2> <tid> Q0 <file> <rank> <rsv> <run_id> <text of passage 3> ... where:

the first column is the tweet id (id field of the JSON format).
the second column currently unused and should always be Q0.
the third column is the file name (without .xml) from which a result is retrieved, which is identical to the of the Wikipedia document. Alternatively, the wikipedia page title can also be used.
the fourth column is the position of the passage in the summary, regardless its informativeness. This is important for readability evaluation where summary structure is considered.
the fifth column shows the score (integer or floating point) that should reflect the estimated informativeness of the passage. This score is used in the pooling process to build informativeness q-rels.
the sixth column is called the "run tag" and should be a unique identifier for your group AND for the method used.
The seventh column is the raw text of the Wikipedia passage. Text is given without XML tags and without formatting characters (avoid "\n","\r","\l"). The resulting word sequence has to appear in the file indicated in the third field. Here is an example of such output:

167999582578552 Q0 3005204 1 0.9999 I10UniXRun1 The Alfred Noble Prize is an award presented by the combined engineering societies of the United States, given each year to a person not over thirty-five for a paper published in one of the journals of the participating societies. 167999582578552 Q0 3005204 3 0.9997 I10UniXRun1 It has no connection to the Nobel Prize, although the two are often confused due to their similar spellings. 167999582578552 Q0 3005204 2 0.9998 I10UniXRun1 The prize was established in 1929 in honor of Alfred Noble, Past President of the American Society of Civil Engineers.

Results

Informativity

Informativity has been evaluated based on three overlapping references:

prior set of relevant pages selected by organizers while building the 2013 topics (40 tweets, 380 passages, 11 523 tokens),
pool selection of most relevant passages from participant submissions for tweets selected by organizers (45 tweets, 1 760 passages, 58 035 tokens),
all relevant text merged together with an extra selection of relevant passages from a random pool of ten tweets (70 tweets, 2 378 passages, 77 043 tokens)

All references are available on task data repository with the evaluation toolkits. Runs are ranked by decreasing score of divergence with the final reference (All.skip).

Rank	Participant	Run	Manual	All.skip	All.bi	All.uni	Pool.skip	Pool.bi	Pool.uni	Prior.skip	Prior.bi	Prior.uni
1	199	256	y	0.8861	0.881	0.782	0.8752	0.87	0.7813	0.921	0.9134	0.7814
2	199	258	n	0.8943	0.8908	0.7939	0.8802	0.8766	0.7916	0.9288	0.9226	0.7985
3	182	275	n	0.8969	0.8924	0.8061	0.8789	0.8745	0.7941	0.9172	0.9106	0.7899
4	182	273	n	0.8973	0.8921	0.8004	0.8802	0.875	0.7923	0.9235	0.9155	0.7862
5	182	274	n	0.8974	0.8922	0.8009	0.8805	0.8751	0.7932	0.9234	0.9154	0.7872
6	199	257	y	0.8998	0.8969	0.7987	0.8916	0.8895	0.801	0.9341	0.928	0.7992
7	65	254	n	0.9242	0.9229	0.8331	0.9162	0.9159	0.8363	0.9473	0.943	0.8223
8	62	276	n	0.9301	0.927	0.8169	0.9333	0.9302	0.8285	0.9718	0.9678	0.8286
9	46	270	n	0.9397	0.9365	0.8481	0.9274	0.9246	0.8418	0.9686	0.9642	0.8529
10	46	267	n	0.9468	0.9444	0.8838	0.9389	0.9362	0.8802	0.9625	0.9596	0.883
11	46	271	n	0.95	0.9475	0.8569	0.9446	0.9421	0.8543	0.9793	0.9759	0.867
12	62	306	n	0.9575	0.954	0.8673	0.9545	0.9519	0.8739	0.9548	0.9486	0.8365
13	210	277	n	0.9662	0.9649	0.8995	0.9642	0.9626	0.9005	0.9792	0.9773	0.9102
14	129	261	n	0.967	0.9668	0.8639	0.9656	0.9659	0.8666	0.9888	0.9862	0.8687
15	129	259	n	0.9679	0.9673	0.8631	0.9668	0.9666	0.8658	0.989	0.987	0.8656
16	129	260	n	0.968	0.9677	0.8643	0.9686	0.9686	0.8679	0.9891	0.987	0.8672
17	128	262	n	0.9747	0.9734	0.8738	0.9736	0.9727	0.8775	0.9821	0.9788	0.8635
18	128	255	n	0.9783	0.9771	0.8817	0.9759	0.9748	0.8801	0.9938	0.9914	0.8941
19	138	265	n	0.9789	0.9781	0.8793	0.9751	0.9749	0.8821	0.9927	0.9904	0.8845
20	138	263	n	0.9793	0.9785	0.8796	0.9759	0.9754	0.8843	0.9925	0.9899	0.8856
21	138	264	n	0.9798	0.9791	0.879	0.9772	0.9769	0.8821	0.9926	0.9902	0.8827
22	275	266	n	0.9835	0.9824	0.9059	0.9865	0.9859	0.9132	0.9903	0.9877	0.8952
23	180	269	n	0.9999	0.9999	0.9965	0.9999	0.9999	0.9972	1	0.9999	0.9962
24	180	269	y	0.9999	0.9999	0.9981	0.9998	0.9998	0.9982	1	1	0.9981

Readability

Readability has been evaluated by organizers over the ten tweets having the largest text references (t-rels). For these tweets, summaries are expected to have almost 500 words since the reference is much larger. For each participant summary, we have then check the number of words over 500 in passages that are:

Relevant (T) i.e. clearly related to the tweet,
Sound (A) i.e. no issues about resolving references to earlier or later items in the discourse.
Non redundant (R) with previous passages.
Syntactically (S) correct.

Non relevant passages have also been considered non sound, redundant and syntactically incorrect.

Ranking: Runs are ranked according to mean average scores per summary over Soundness, Non redundancy and Syntactically correctness among Relevant passages

Rank	Mean Average	Relevancy (T)	Non redundancy (R)	Soundness (A)	Syntax (S)	Run
1	72.44%	76.64%	67.30%	74.52%	75.50%	275
2	72.13%	74.24%	71.98%	70.78%	73.62%	256
3	71.71%	74.66%	68.84%	71.78%	74.50%	274
4	71.35%	75.52%	67.88%	71.20%	74.96%	273
5	69.54%	72.18%	65.48%	70.96%	72.18%	257
6	67.46%	73.30%	61.52%	68.94%	71.92%	254
7	65.97%	68.36%	64.52%	66.04%	67.34%	258
8	49.72%	52.08%	45.84%	51.24%	52.08%	276
9	46.72%	50.54%	40.90%	49.56%	49.70%	267
10	44.17%	46.84%	41.20%	45.30%	46.00%	270
11	38.76%	41.16%	35.38%	39.74%	41.16%	271
12	38.56%	41.26%	33.16%	41.26%	41.26%	264
13	38.21%	38.64%	37.36%	38.64%	38.64%	260
14	37.92%	39.46%	36.46%	37.84%	39.46%	265
15	37.70%	38.78%	35.54%	38.78%	38.78%	259
16	36.59%	38.98%	31.82%	38.98%	38.98%	255
17	35.99%	36.42%	35.14%	36.42%	36.42%	261
18	32.75%	34.48%	31.86%	31.92%	34.48%	263
19	32.35%	33.34%	30.38%	33.34%	33.34%	262
20	25.64%	25.92%	25.08%	25.92%	25.92%	266
21	20.00%	20.00%	20.00%	20.00%	20.00%	277
22	00.04%	00.04%	00.04%	00.04%	00.04%	269

Pilot subtask in Spanish

A extra set of topics (only tweet texts) has been released in Spanish to try a different language and a slightly different task. Topics in Spanish are opinionated personal tweets about music bands, cars and politics. They were manually selected from CLEF RepLab 2013 test set among those without external url and with at least 15 words. Contextualization should help the reader to also understand the opinion polarity, allusions and humor.

Spanish corpus is available here and topics are here.

Special settings:

Runs in Spanish are due on may 31 at the very latest.
Each team can submit at most three extra runs in Spanish.
You can use your own copy of the spanish wikipedia instead of the official corpus.
Participants that do not work on Spanish language are also welcomed to participate, a separate ranking for systems that do not use Spanish language resources will be published.

Schedule

Corpus released: 18 February
Topic and tasks guidelines released: 1 March
Run submission deadline English topics: 1 May (Extended deadline)
Evaluation results sent out: 21 May
Run submission deadline Spanish topics: 31 May (firm)
Participant papers (CLEF proceedings) due: 15 June. Use easy chair and select "INEX - Focused retrieval".
Overview paper due: 30 June

Organizers (contact : contextweet@limsi.fr)

Patrice Bellot, LSIS - Aix-Marseille University
Josiane Mothe, IRIT, University of Toulouse
Véronique Moriceau, LIMSI-CNRS, University Paris-Sud
Eric SanJuan, LIA, University of Avignon
Xavier Tannier, LIMSI-CNRS, University Paris-Sud

Imprint | Data protection | Contact someone about INEX