|
INEX 2011 Books and Social Search Track
|
|
|
Announcements
Download the Task Guidelines for the SB and PI tasks.
Results for the Social Search for Best Books task available.
Overview
For centuries books were the dominant source of information, but how we acquire, share, and publish information is changing in fundamental ways due to the Web.
The goal of the Books and Social Search Track is to investigate techniques to support users in searching and navigating the full texts of digitized books and
complementary social media as well as providing a forum for the exchange of research ideas and contributions. Towards this goal the track is building appropriate
evaluation benchmarks complete with test collections for focused, social and semantic search tasks.
The track touches on a range of fields, including information retrieval (IR), information science (IS), human computer interaction (HCI), digital libraries (DL), and eBooks.
In 2011, the track will focus on two main search tasks and two additional related activities:
-
Social Search for Best Books (SB): Building on a collection of 2.8 million records from Amazon Books and LibraryThing.com, this task will investigate the value of
user-generated metadata, such as reviews and tags, in addition to publisher-supplied and library catalogue metadata, to aid retrieval systems in finding the best,
most relevant books for a set of topics of interest. Systems will need to return a reading list comprising a ranking of recommended books for each topic.
Topics will vary in their format, from simple queries to more extensive descriptions of the information need, to also including example books and indications
of the user's level of knowledge on the topic.
-
Prove It (PI): Building on a corpus of over 50,000 digitized books, this task will test the application of both focused and semantic retrieval approaches to
digitized books. Systems will need to return a ranked list of book pages containing information that can confirm or refute a factual statement, (optionally)
indicating for each result whether it provides positive or negative evidence regarding the factual claim.
-
Active Reading (AR): Building on a selection of up to 50 books, relevant to selected user communities (e.g., children establishing their literacy skills,
historians, adults reading for pleasure etc.) in specific scenarios (e.g., fact finding, learning, reading for pleasure, etc.), this task will involve
conducting a series of user studies into active reading, exploring how and why readers use eBooks with a focus on eBook usability.
Participants will share a common study design and will be supported in preparing a suitable collection, setting up the studies and analysing
the resulting data. The main outcome of this task will be the comparison of results across collections and scenarios.
-
Structure Extraction (SE): Run as part of the ICDAR 2011 competition,
this task builds on a set of 1,000 digitized books to evaluate automatic techniques for constructing
hyperlinked table of contents from OCR text and layout information.
Participation
Participants will gain access to the 2011 data sets (see below) as well as the official INEX 2010 topics and relevance judgements, which may be used for training.
Important! All participants are expected to submit a minimum of one run (or set of study results) to at least one of the tasks listed above.
Only participants who contributed runs will gain access to relevance data collected in 2011.
Optionally, participants may also contribute to the relevance assessments (either directly or via crowdsourcing).
Data sets
-
The book corpus consists of over 50,000 digitized, out-of-copyright books, provided by Microsoft Live Book Search and the Internet Archive (for non-commercial purposes only).
The full texts of the books is in an XML format, referred to as BookML. Most books also have an associated metadata file (*.mrc), containing publication (author, title, etc.)
and classification information in MAchine-Readable Cataloging (MARC) record format.
-
The social data collection contains metadata for 2.8 million books crawled from the online book store of Amazon and the social cataloging web site of LibraryThing
in February and March 2009 by the University of Duisburg-Essen. The data set is available as a MySQL database or as XML. For each book, the following metadata is
included:
-
From Amazon: ISBN, title, binding, label, list price, number of pages, publisher, dimensions, reading level, release date, publication date, edition,
Dewey classification, title page images, creators, similar products, height, width, length, weight, reviews (rating, author id, total votes, helpful votes,
date, summary, content) editorial reviews (source, content).
-
From LibraryThing: Tags (including occurrence frequency), blurbs, dedications, epigraphs, first words, last words, quotations, series, awards,
browse nodes, characters, places, subjects.
- The bookshelves used for the AR task can be freely selected from the main corpus of 50,000 digitized books. Please contact the organizers with your request.
- A corpus of 1,000 books in DjVu XML format and PDF is available for the SE task.
Resources
Participants will have access to a dedicated Book Track server that hosts all the data (see above)
as well as the Book Search system that allows participants to browse and search the books in the main corpus.
Schedule
Common deadlines for all tasks:
Nov TBC | Papers due for the INEX 2011 proceedings |
Dec 12-14 | INEX Workshop |
Search tasks (SB and PI):
May | Data sets available for download |
June 15 | Task descriptions published |
June 15 | Topic sets available |
Sept 4 | Run submissions deadline |
Sept 30 | Release of relevance assessments and results |
Active Reading task:
July 15 | Submission of draft user study plan and request for up to 50 books |
Sep 30 | Submission deadline for user study results |
Oct 20 | Distribution of collected data |
Structure Extraction Task:
March 1 | Data available for download |
May 3 | Submissions of generated ToCs |
June 7 | Groundtruth annotations due |
Sept 18 | Result announcement and distribution of ground-truth |
See SE competition web site for further details.
Organizers
Search tasks
Gabriella Kazai
Microsoft Research Cambridge
gabkaz@microsoft.com
Marijn Koolen
University of Amsterdam
M.H.A.Koolen@uva.nl
Jaap Kamps
University of Amsterdam
kamps@uva.nl
Structure Extraction task
Antoine Doucet
University of Caen
doucet@info.unicaen.fr
Active Reading task
Monica Landoni
University of Lugano
monica.landoni@unisi.ch