INEX 2012 Linked Data Track

Home | About | 2013

Overview

The goal of the new Linked Data track is to investigate retrieval techniques over a combination of textual and highly structured data, where RDF properties carry additional key information about semantic relations among data objects that cannot be captured by keywords alone. We intend to investigate if and how structural information could be exploited to improve ad-hoc retrieval performance, and how it could be used in combination with structured queries to help users navigate or explore large sets of results (known from faceted search systems), or to address Jeopardy-style natural-language queries (known from question answering) which are translated into a semi-structured query format.

The new Linked Data track at INEX 2012 thus aims to close the gap between IR-style keyword search and Semantic-Web-style reasoning techniques. Our goal is to bring together different communities and to foster research at the intersection of Information Retrieval, Databases, and the Semantic Web.

As its core collection, the Linked Data track will use a fusion of XML-ified Wikipedia articles with RDF properties from both DBpedia and YAGO2, the latter of which contain the article entity as either their first or second argument. The core collection (Wikipedia-LOD v1.1, see below) is based on the popular MediaWiki format (see http://dumps.wikimedia.org/enwiki/20110722/), where all Wiki-markup has been replaced by proper XML tags and CDATA sections. In addition, all internal Wikipedia links (including the article entity itself) have been enriched with links to both their corresponding DBpedia and YAGO2 entities (as far as available). Participants are explicitly encouraged to make use of more RDF facts available from DBpedia and YAGO2, in particular for processing the reasoning-related Faceted Search and Jeopardy topics.

For INEX 2012, we will explore three different retrieval tasks:

The classic Ad-hoc Retrieval task investigates informational queries to be answered mainly by the textual contents of the Wikipedia articles.
The Faceted Search task employs a hand-crafted hierarchy of facets and facet-values obtained from DBpedia that aim to guide the searcher toward relevant information.
The brand-new Jeopardy task employs natural-language Jeopardy clues which are manually translated into a semi-structured query format based on SPARQL with keyword conditions.

Data Collection

The new Wikipedia-LOD collection is available from the following link:

Wikipedia-LOD v1.1 (updated on 2012-06-22): http://inex-lod.mpi-inf.mpg.de/2012/

The collection consists of 3 compressed tar.gz files and contains an overall amount of 3.2 Mio XML articles. The uncompressed size of the collection is 61 GB.

A DTD for the XML collection is available here: wikipedia-lod-xml.dtd

In addition to the new core collection, which is based on XML-ified Wikipedia articles, the Linked Data track explicitly encourages (but does not require) the use of current Linked Data dumps for DBpedia (v3.7) and YAGO2, which are available from the following URLs:

DBpedia v3.7 (created in July 2011): http://downloads.dbpedia.org/3.7/en/
YAGO2 core and full dumps (created on 2012-01-09): http://www.mpi-inf.mpg.de/yago-naga/yago/

Each Wikipedia-LOD article consists of a mixture of XML tags and CDATA sections, containing infobox attributes, free-text contents describing the entity or category that the article captures, and a section with both DBpedia and YAGO2 properties that are related to the article's entity. All sections contain links to other Wikipedia articles (including links to the corresponding DBpedia and YAGO2 resources), Wikipedia categories, and external Web pages.

DBpedia and YAGO2 are two comprehensive, common-sense knowledge bases providing structured information that has been semi-automatically extracted mostly using Wikipedia infoboxes and categories. Both knowledge bases focus on extracting attribute-value pairs from Wikipedia infoboxes and category lists, which serve as basis for applying various information extraction techniques. They also contain geo-coordinates, links between Wikipedia pages, redirection and disambiguation pages, external links, and so on. Each Wikipedia page corresponds to a resource in DBpedia and YAGO2. The connection between the data sets is given in the "wikipedia_links_en" file from DBpedia. See, for example:
<http://dbpedia.org/resource/AccessibleComputing> <http://xmlns.com/foaf/0.1/page> < http://en.wikipedia.org/wiki/AccessibleComputing>

The Linked Data track is intended as an open track and thus invites participants to include more Linked Data (see, for example, linkeddata.org) or other sources that go beyond "just" DBpedia and YAGO2. Any inclusion of further data sources is welcome, however, research papers and workshop submissions should explicitly mention these sources when describing their approaches.

1. Ad-Hoc Search Task

The task is to return a ranked list of results (Wikipedia pages or equivalently DBpedia resources) estimated relevant to the user’s information need.

Topics

Topics for the Ad-hoc Search task

Submission Format

Each participant may submit up to 3 runs. Each run can contain a maximum of 1000 results per topic, ordered by decreasing value of relevance. The results of one run must be contained in one submission file (i.e. up to 3 files can be submitted in total). For relevance assessment and evaluation of the results we require submission files to be in the familiar TREC format.

Here:

the first column is the topic number.
the second column is the query number within that topic. This is currently unused and should always be Q0.
the third column is the ID of the result Wikipedia page.
the fourth column is the rank of the result.
the fifth column shows the score (integer or floating point) that generated the ranking.
the sixth column is called the "run tag" and should be a unique identifier for your group AND for the method used. Run tags must contain 12 or fewer letters and numbers, with NO punctuation, to facilitate labeling graphs with the tags.

An example submission is:

2012001 Q0 12 1 0.9999 2012UniXRun1
2012001 Q0 997 2 0.9998 2012UniXRun1
2012001 Q0 9989 3 0.9997 2012UniXRun1

Here are three results for topic “2012001”. The first result is the Wikipedia page with ID "12". The second result is the page with ID "997", and the third result is the page with ID "9989".

Relevance Assessments

Relevance assessment will be conducted by participating groups. All the submitted results for each query will be pooled and assessors are asked to identify all relevant results in the pool using the INEX assessment tool.

Evaluation

The effectiveness of the retrieval results submitted by the participants will be evaluated using the classical IR metrics, e.g. MAP, P@5, P@10, NDCG and so on.

2. Faceted Search Task

Given an exploratory or broad query, the search system may return a large number of results. Faceted search is a way to help users navigate through the large set of results to quickly identify the results of interest. It presents the user a list of facet-values to refine the query. After the user choosing from the suggested facet-values, the result list is narrowed down and then the system may present a new list of facet-values for the user to further refine the query. The interactive process continues until the user finds the items of interest. One of the key issues in faceted search systems is to recommend appropriate facet-values to help the user quickly identify what he/she really wants in the large set of results. The task aims to investigate different techniques of recommending facet-values.

Topics

Topics for the Faceted Search task

Submission Format

The result file, which contains a result list of maximum 2000 results for each general topic, will be provided by the organizers.

Reference result list for the faceted search task

Based on the reference result file, each participant may submit up to 3 runs. Each run is a XML file conforming to the following DTD, which contains a hierarchy of recommended facet-values for each topic, in which each node represents a facet-value and all of its children form the newly recommended facet-value list when the user selects this facet-value to refine the query. The maximum fanout of each node in the hierarchy is restricted to be 20.
<!ELEMENT run (topic+)>
<!ATTLIST run rid ID #REQUIRED>
<!ELEMENT topic (fv+)>
<!ATTLIST topic tid ID #REQUIRED>
<!ELEMENT fv (fv*)>
<!ATTLIST fv f CDATA #REQUIRED
v CDATA #REQUIRED>

Here:

the root element is <run>, which has an ID type attribute, rid, representing the unique identifier of the run.
the <run> contains one or more <topic>s. The ID type attribute, tid, in each <topic> gives the topic number.
each <topic> includes a list of <fv>s. Each <fv> shows a facet-value pair, with f attribute being the facet and v attribute being the value. All the possible facet-values are the predicate-object pairs associated with the DBpedia resources.
the <fv>s can be nested to form a hierarchy of facet-values.

An example submission is:

Here for the topic “2012001”, the search system first recommends the facet-value condition "dbpedia-owl:date=1955-11-01" among other facet-value conditions, which are its siblings. If the user selects this condition to refine the query, the system will recommend a new list of facet-value conditions, which are "dbpedia-owl:place=dbpedia:South_Vietnam" and "dbpedia-owl:place=dbpedia:North_Vietnam". If the user then selects "dbpedia-owl:place=dbpedia:North_Vietnam", the system will recommend the facet-value condition "rdbprob:capital=dbpedia:Ho_Chi_Minh_City". Note that no facet-value condition occurs twice on a path in the hierarchy.

Relevance Assessments

All the subtopics will be added to the set of topics for the ad-hoc search task. Thus they can be assessed in the same way as other ad-hoc search topics. We will take the relevance results of all or any of the subtopics for a particular general topic as the relevance result of this general topic.

Evaluation

We use again the NDCG [1] and Interaction Cost [2] metrics as that used in INEX 2011 to evaluate the runs. In addition, we might test some other possible metrics, e.g. α-nDCG proposed in [3].

[1] A. Schuth, M. Marx, Evaluation methods for rankings of facetvalues for faceted search, In Multilingual and Multimodal Information Access Evaluation 2011, Volume 6941, pages 131-136.
[2] Q. Wang, G.R. Camps, M. Marx, M. Theobald, J. Kamps, Overview of INEX 2011 Data-Centric Track, In Pre-proceedings of INEX 2011.
[3] C.L. Clarke, M. Koola, G.V. Cormack, et. al., Novelty and diversity in information retrieval evaluation, SIGIR 2008, pages 659-666.

3. Jeopardy Task

The new Jeopardy task investigates retrieval techniques over a set of natural-language Jeopardy clues, which have been manually translated into SPARQL query patterns and enhanced with keyword-based filter conditions. Specifically, we investigate a data model, where every entity (in DBpedia or YAGO) is associated with the Wikipedia article (contained in the Wikipedia-LOD v1.1 collection) that describes this entity.

For example, topic no. 2012301 from the current set of Jeopardy topics looks as follows:

<topic id="2012301" category="LAKES">
    <jeopardy_clue>Niagara Falls has its source of origin from this lake. </jeopardy_clue>
    <keyword_title>Niagara Falls source lake</keyword_title>
    <sparql_ft>
      Select ?q Where {
         <http://dbpedia.org/resource/Niagara_Falls> <http://dbpedia.org/property/watercourse> ?o .
         ?o <http://dbpedia.org/ontology/origin> ?q .
         Filter FTContains(?o, "river water course niagara") .
         Filter FTContains(?q, "lake origin of")}
    </sparql_ft>
</topic>

The <jeopardy_clue> element contains the original Jeopardy clue as a natural-language sentence; the <keyword_title> element contains a set of keywords that has been manually extracted from this title and will be reused as part of the ad-hoc task; and the <sparql_ft> element contains a formulation of the natural-language sentence into a correspondig SPARQL pattern. The <category> attribute of the <topic> element may be used as an additional hint for disambiguating the query.

In the above query, the DBpedia entity http://dbpedia.org/resource/Niagara_Falls has been marked as the subject of the first triplet pattern, while both the object of the first triplet pattern and the subject and object of the second triplet pattern are unknown. The two FTContains filter conditions however restrict both these subjects and objects to entities that should be associated with the keywords "river water course niagara" and "lake origin" via their corresponding Wikipedia articles, respectively.

Since this particular variant of SPARQL with full-text filter conditions cannot be run against a standard RDF collection (such as DBpedia or YAGO) alone, participants are encouraged to develop individual solutions to index both the RDF and textual contents of the Wikipedia-LOD collection in order to process these queries.

Topics

An XML file with 90 Jeopardy-style topics is available here:

Topics for the Jeopardy Search task

Submission Format

Again, each participanting group may submit up to 3 runs. Each run can contain a maximum of 1000 results per topic, ordered by decreasing value of relevance (although we expect most topics to have just one or a few entities or sets of entities as targets). The results of one run must be contained in one submission file (i.e. up to 3 files can be submitted in total). For relevance assessment and evaluation of the results we require submission files to be in the familiar TREC format, however containing one row of target entities (denoted by their Wikipedia page ID's) that denote each query result. Each row of target entities must reflect the order of query variables as specified by the Select clause of the Jeopardy topic. In case the Select clause contains more than one query variable, the row should consist of a comma- or semicolon-separated list of target entity ID's.

Here:

the first column is the topic number.
the second column is the query number within that topic. This is currently unused and should always be Q0.
the third column is a comma- or semicolon-separated list the ID’s of the resulting Wikipedia page(s).
the fourth column is the rank of the result.
the fifth column shows the score (integer or floating point) that generated the ranking.
the sixth column is called the "run tag" and should be a unique identifier for your group AND for the method used. Run tags must contain 12 or fewer letters and numbers, with NO punctuation, to facilitate labeling graphs with the tags.

An example submission is:

2012301 Q0 12 1 0.9999 2012UniXRun1
2012301 Q0 997 2 0.9998 2012UniXRun1
2012301 Q0 9989 3 0.9997 2012UniXRun1

Here are three results for topic "2012301". The first result is the entity (i.e. Wikipedia page) with ID "12". The second result is the entity with ID "997", and the third result is the entity with ID "9989".

Relevance Assessments

Relevance assessments for the Jeopardy task will be conducted by the groups participating in this task. All the submitted results for each query will be pooled and assessors are asked to identify all relevant results in the pool using the INEX assessment tool.

In addition, all keyword titles from the Jeopardy topics will be added to the set of topics for the ad-hoc search task. Thus they can be assessed in the same way as the other ad-hoc search topics.

Evaluation

The effectiveness of the retrieval results submitted by the participants will be evaluated using the classical IR metrics, e.g. MAP, P@5, P@10, NDCG and so on. In addition, we will explore different measures known from entity-centric retrieval measures.

Evaluation Results

Schedule

April 14         Data collection available for download
May 1            Ad-hoc & Faceted Search topics distributed
June 6           Jeopardy topics distributed
July 14         Run submission deadline
July 15-31     Relevance assessments
Aug. 1           Release of assessments and results
Aug. 17         Submission of CLEF 2012 Working Notes papers
Aug. 24         Submission of CLEF 2012 Labs Overviews
Sep. 17-20    CLEF 2012 Conference

Organizers

Ad-hoc Search Task

Qiuyue Wang
Renmin University of China
qiuyuew@ruc.edu.cn

Jaap Kamps
University of Amsterdam
kamps@uva.nl

Georgina Ramírez Camps
Universitat Pompeu Fabra
georgina.ramirez@upf.edu

Faceted Search Task

Maarten Marx
University of Amsterdam
maartenmarx@uva.nl

Anne Schuth
University of Amsterdam
anne.schuth@uva.nl

Jeopardy Search Task

Martin Theobald
Max-Planck-Institut für Informatik
martin.theobald@mpi-inf.mpg.de

Sairam Gurajada
Max-Planck-Institut für Informatik
gurajada@mpi-inf.mpg.de

Arunav Mishra
Max-Planck-Institut für Informatik
amishra@mpi-inf.mpg.de

Imprint | Data protection | Contact someone about INEX