INEX 2013 Linked Data Track

Home | About | 2012

News

Evaluation results for the Adhoc task released!

Overview

The goal of the Linked Data track is to investigate retrieval techniques over a combination of textual and highly structured data, where RDF properties carry additional key information about semantic relations among data objects that cannot be captured by keywords alone. We intend to investigate if and how structural information could be exploited to improve ad-hoc retrieval performance, and how it could be used in combination with structured queries to help users navigate or explore large result sets via Ad-hoc queries, or to address Jeopardy-style natural-language queries which are translated into a SPARQL-based query format.

The Linked Data track thus aims to close the gap between IR-style keyword search and Semantic-Web-style reasoning techniques. Our goal is to bring together different communities and to foster research at the intersection of Information Retrieval, Databases, and the Semantic Web.

The Linked Data track will use a subset of DBpedia and YAGO2s together with a recent dump of Wikipedia core articles. In addition to these reference collections, we will also provide two supplementary collections: (1) to lower the participation threshold for participants with IR engines, a fusion of XML-ified Wikipedia articles with RDF properties from both DBpedia and YAGO2s, and (2) to lower the participation threshold for participants with RDF engines, a dump of the textual content of Wikipedia articles in RDF. Participants are explicitly encouraged to make use of more RDF facts available from DBpedia and YAGO2s, in particular for processing the reasoning-related Jeopardy topics.

For INEX 2013, we will explore two different retrieval tasks that continue from INEX 2012:

The classic Ad-hoc Retrieval task investigates informational queries to be answered mainly by the textual contents of the Wikipedia articles.
The Jeopardy task employs natural-language Jeopardy clues which are manually translated into a semi-structured query format based on SPARQL with keyword conditions.

Data Collection

Reference Collection

The Linked Data track uses the following data sets:

the English Wikipedia dump from June 1st, 2012
the following subsets of the Canonicalized Datasets from DBpedia 3.8
- DBpedia Infobox Types
- DBpedia Infobox Properties
- DBpedia Titles
- DBpedia Geographic Coordinates
- DBpedia Homepages
- DBpedia Persondata
- DBpedia Inter-Language Links
- DBpedia Articles Categories
- DBpedia Page IDs
- DBpedia External Links
- DBpedia-to-Wikipedia Page Links
- DBpedia links to other collections from this file (not required for queries, only if you plan to use external collections)
the following subsets of the Canonicalized Datasets from YAGO2s

Since DBpedia and Wikipedia may contain different entities, valid results are restricted to the list of valid DBpedia URI's which provides one triple for each valid result entity; other forms of this list (such as valid article ids) may be available on request. If a run includes an entity not on this list, the entity will be considered non-relevant.

Supplementary Collections

Article Texts as RDF Triples

The full text (without XML markup) of each Wikipedia article in the dump is available in the following files (you need all 4 files): part 1, part 2, part 3, part 4.

This file maps each DBpedia entity to its corresponding text file.

Wikipedia-LOD collection v2.0

The Wikipedia-LOD collection v2.0 is based on the Wikipedia articles dump, but all Wiki-markup has been replaced by proper XML tags and CDATA sections. Each Wikipedia-LOD article consists of a mixture of XML tags and CDATA sections, containing infobox attributes, free-text contents describing the entity or category that the article captures, and a section with both DBpedia and YAGO2s properties that are related to the article's entity. All sections contain links to other Wikipedia articles (including links to the corresponding DBpedia and YAGO2s resources), Wikipedia categories, and external Web pages.

The Wikipedia-LOD collection is available from the following link:

Wikipedia-LOD v2.0: http://inex-lod.mpi-inf.mpg.de/2013/

The collection consists of four compressed tar.bz files.

The Linked Data track is intended as an open track and thus invites participants to include more structured data from DBpedia and YAGO2s, from the Linked Data Cloud (see, for example, linkeddata.org), or any other sources. Any inclusion of further data sources is welcome, however, research papers and workshop submissions should explicitly mention these sources when describing the approaches.

Ad-Hoc Search Task

The task is to return a ranked list of results (Wikipedia pages or equivalently DBpedia resources) that are considered to be relevant to the user's information need.

2013 Ad-hoc Task Search Topics

The final set of 144 Ad-hoc task search topics for the INEX 2013 Linked Data track has been released is now available for download.

You may want to consider the topics from the 2012 LOD track together with the qrels for training.

Submission Format

Each participant may submit up to 3 runs. Each run can contain a maximum of 1000 results per topic, ordered by decreasing value of relevance. A result is an article or an entity, identified by its Wikipedia page id (so only entities from DBpedia or, equivalently, articles from Wikipedia are valid results). The results of one run must be contained in one submission file, so up to three files can be submitted in total. For relevance assessment and evaluation of the results we require submission files to be in the familiar TREC format.

<qid> Q0 <file> <rank> <rsv> <run_id>

Here:

the first column is the topic number.
the second column is the query number within that topic. As of the early TREC days, this field is unused and should always be Q0.
the third column is the ID of the result Wikipedia page.
the fourth column is the rank of the result.
the fifth column shows the score (integer or floating point) that generated the ranking.
the sixth column is called the "run tag" and should be a unique identifier for your group AND for the method used. Run tags must contain 12 or fewer letters and numbers, with NO punctuation, to facilitate labeling graphs with the tags.

An example submission is:

2013001 Q0 12 1 0.9999 2013UniXRun1 2013001 Q0 997 2 0.9998 2013UniXRun1 2013001 Q0 9989 3 0.9997 2013UniXRun1

It contains three results for topic 2013001. The first result is the Wikipedia page with ID "12". The second result is the page with ID "997", and the third result is the page with ID "9989". Mappings between DBpedia URI's and Wikipedia page ID's are available from the DBpedia PageIDs which is part of the reference collection. Please restrict your results to the DBpedia URI's provided in the list of valid DBpedia URI's.

Evaluation

The effectiveness of the retrieval results submitted by the participants will be evaluated using the classical IR metrics, e.g. MAP, P@5, P@10, NDCG, and possibly more. As usual, all run submissions will be pooled, and we will set up a crowdsourcing interface on Amazon Mechanical Turk for the evaluation of the pools.

Jeopardy Task

The Jeopardy task investigates retrieval techniques over a set of natural-language Jeopardy clues, which have been manually translated into SPARQL query patterns and enhanced with keyword-based filter conditions.

2013 Jeopardy Task Search Topics

The final set of 105 Jeopardy task search topics for the INEX 2013 Linked Data track has been released is now available for download.

You may want to consider the topics from 2012 together with the qrels for training.

We illustrate the topic format with the example of topic 2012301 from the set of the 2012 topics. It is represented in XML format as follows:


<topic id="2012374" category="Politics">

    <jeopardy_clue>Which German politician is a successor of another politician who stepped down before his or her actual term was over, and what is the name of their political ancestor?</jeopardy_clue>

    <keyword_title>German politicians successor other stepped down before actual term name ancestor</keyword_title>

    <sparql_ft>

         SELECT ?s ?s1 WHERE {
 
           ?s rdf:type <http://dbpedia.org/class/yago/GermanPoliticians> . 

           ?s1 <http://dbpedia.org/property/successor> ?s . 

           FILTER FTContains (?s, "stepped down early") . 

         }

    </sparql_ft>

</topic>

The <jeopardy_clue> element contains the original Jeopardy clue as a natural-language sentence; the <keyword_title> element contains a set of keywords that has been manually extracted from this title and will be reused as part of the ad-hoc task; and the <sparql_ft> element contains the result of a manual conversion of the natural-language sentence into a corresponding SPARQL query. The <category> attribute of the <topic> element may be used as an additional hint for disambiguating the query.

In the above query, ?s is a variable for an entity of type http://dbpedia.org/class/yago/GermanPoliticians (first triple pattern), and it should be in a http://dbpedia.org/property/successor relationship with another entity ?s1. The FTContains filter condition restricts ?s to entities that should be associated with the keywords "stepped down early" via its corresponding Wikipedia article.

Since this particular variant of SPARQL with full-text filter conditions cannot be run against a standard RDF collection (such as DBpedia or YAGO) alone, participants are encouraged to develop individual solutions to index both the RDF and textual contents of the Wikipedia-LOD collection in order to process these queries.

Submission Format

Each participating group may submit up to 3 runs. Each run can contain a maximum of 1000 results per topic, ordered by decreasing value of relevance (although we expect most topics to have just one or a few entities or sets of entities as targets). The results of one run must be contained in one submission file (i.e. up to 3 files can be submitted in total). For relevance assessment and evaluation of the results we require submission files to be in the familiar TREC format, however containing one row of target entities (denoted by their Wikipedia page ID's, which are available in the reference collection through <http://dbpedia.org/ontology/wikiPageID> properties) that denote each query result. Each row of target entities must reflect the order of query variables as specified by the SELECT clause of the Jeopardy topic. In case the SELECT clause contains more than one query variable, the row should consist of a comma- or semicolon-separated list of target entity ID's.

<qid> Q0 <file> <rank> <rsv> <run_id>

Here:

the first column is the topic number.
the second column is the query number within that topic. This field is unused and should always be Q0.
the third column is a comma- or semicolon-separated list the ID's of the resulting Wikipedia page(s).
the fourth column is the rank of the result.
the fifth column shows the score (integer or floating point) that generated the ranking.
the sixth column is called the "run tag" and should be a unique identifier for your group AND for the method used. Run tags must contain 12 or fewer letters and numbers, with NO punctuation, to facilitate labeling graphs with the tags.

An example submission is:

2012374 Q0 12;24 1 0.9999 2012UniXRun1 2012374 Q0 997;998 2 0.9998 2012UniXRun1 2012374 Q0 9989;12345 3 0.9997 2012UniXRun1

Here are three results for topic "2012374"; this topic requests two entities per result since it has two variables in the SELECT clause. The first result is the entity pair (denoted by their Wikipedia page ID's) with the ID's "12" and "24". The second result is the entity pair with the ID's "997" and "998", and the third result is the entity pair with the ID's "9989" and "12345". Mappings between DBpedia URI's and Wikipedia page ID's are available from the DBpedia to Wikipedia page links which is part of the reference collection. Please restrict your results to the DBpedia URI's provided in the list of valid DBpedia URI's.

Evaluation

The effectiveness of the retrieval results submitted by the participants will be evaluated using the classical IR metrics, e.g. MAP, P@5, P@10, NDCG, and possibly others. All run submissions will be pooled, and we will set up a crowdsourcing interface on Amazon Mechanical Turk for the evaluation of these pools. Jeopardy runs will be evaluated in an entity-centric manner, i.e., only one (or a few) Wikipedia page ID's that point to the target entities which are demanded by the queries will be considered to be relevant. Wikipedia pages that are only mentioning the actual target entities will not be considered to be relevant.

Schedule

January 17		Reference collections available for download
February 1		Supplementary collections available for download
March 15		Topics for Ad-hoc & Jeopardy tasks distributed
May 15		Run submission deadline
May 15-31		Relevance assessments
June 8		Release of assessments and results
June 15		Submission of CLEF 2013 Working Notes papers
June 30		Submission of CLEF 2013 Labs Overviews
Sep. 23-26		CLEF 2013 Conference

Organizers

Martin Theobald
University of Antwerp
martin.theobald@ua.ac.be

Qiuyue Wang
Renmin University of China
qiuyuew@ruc.edu.cn

Sairam Gurajada
Max-Planck-Institut für Informatik
gurajada@mpi-inf.mpg.de

Jaap Kamps
University of Amsterdam
kamps@uva.nl

Arunav Mishra
Max-Planck-Institut für Informatik
amishra@mpi-inf.mpg.de

Ralf Schenkel
Max-Planck-Institut für Informatik
schenkel@mpi-inf.mpg.de

Imprint | Data protection | Contact someone about INEX