Announcements

Download the Task Guidelines for the SB and PI tasks.

Results for the Social Search for Best Books task available.

Overview

For centuries books were the dominant source of information, but how we acquire, share, and publish information is changing in fundamental ways due to the Web. The goal of the Books and Social Search Track is to investigate techniques to support users in searching and navigating the full texts of digitized books and complementary social media as well as providing a forum for the exchange of research ideas and contributions. Towards this goal the track is building appropriate evaluation benchmarks complete with test collections for focused, social and semantic search tasks. The track touches on a range of fields, including information retrieval (IR), information science (IS), human computer interaction (HCI), digital libraries (DL), and eBooks.

In 2011, the track will focus on two main search tasks and two additional related activities:

Social Search for Best Books (SB): Building on a collection of 2.8 million records from Amazon Books and LibraryThing.com, this task will investigate the value of user-generated metadata, such as reviews and tags, in addition to publisher-supplied and library catalogue metadata, to aid retrieval systems in finding the best, most relevant books for a set of topics of interest. Systems will need to return a reading list comprising a ranking of recommended books for each topic. Topics will vary in their format, from simple queries to more extensive descriptions of the information need, to also including example books and indications of the user's level of knowledge on the topic.
Prove It (PI): Building on a corpus of over 50,000 digitized books, this task will test the application of both focused and semantic retrieval approaches to digitized books. Systems will need to return a ranked list of book pages containing information that can confirm or refute a factual statement, (optionally) indicating for each result whether it provides positive or negative evidence regarding the factual claim.
Active Reading (AR): Building on a selection of up to 50 books, relevant to selected user communities (e.g., children establishing their literacy skills, historians, adults reading for pleasure etc.) in specific scenarios (e.g., fact finding, learning, reading for pleasure, etc.), this task will involve conducting a series of user studies into active reading, exploring how and why readers use eBooks with a focus on eBook usability. Participants will share a common study design and will be supported in preparing a suitable collection, setting up the studies and analysing the resulting data. The main outcome of this task will be the comparison of results across collections and scenarios.
Structure Extraction (SE): Run as part of the ICDAR 2011 competition, this task builds on a set of 1,000 digitized books to evaluate automatic techniques for constructing hyperlinked table of contents from OCR text and layout information.

Participation

Participants will gain access to the 2011 data sets (see below) as well as the official INEX 2010 topics and relevance judgements, which may be used for training.

Important! All participants are expected to submit a minimum of one run (or set of study results) to at least one of the tasks listed above. Only participants who contributed runs will gain access to relevance data collected in 2011. Optionally, participants may also contribute to the relevance assessments (either directly or via crowdsourcing).

Data sets

The book corpus consists of over 50,000 digitized, out-of-copyright books, provided by Microsoft Live Book Search and the Internet Archive (for non-commercial purposes only). The full texts of the books is in an XML format, referred to as BookML. Most books also have an associated metadata file (*.mrc), containing publication (author, title, etc.) and classification information in MAchine-Readable Cataloging (MARC) record format.
The social data collection contains metadata for 2.8 million books crawled from the online book store of Amazon and the social cataloging web site of LibraryThing in February and March 2009 by the University of Duisburg-Essen. The data set is available as a MySQL database or as XML. For each book, the following metadata is included:
1. From Amazon: ISBN, title, binding, label, list price, number of pages, publisher, dimensions, reading level, release date, publication date, edition, Dewey classification, title page images, creators, similar products, height, width, length, weight, reviews (rating, author id, total votes, helpful votes, date, summary, content) editorial reviews (source, content).
2. From LibraryThing: Tags (including occurrence frequency), blurbs, dedications, epigraphs, first words, last words, quotations, series, awards, browse nodes, characters, places, subjects.
The bookshelves used for the AR task can be freely selected from the main corpus of 50,000 digitized books. Please contact the organizers with your request.
A corpus of 1,000 books in DjVu XML format and PDF is available for the SE task.

Resources

Participants will have access to a dedicated Book Track server that hosts all the data (see above) as well as the Book Search system that allows participants to browse and search the books in the main corpus.

Schedule

Common deadlines for all tasks:

Nov TBC	Papers due for the INEX 2011 proceedings
Dec 12-14	INEX Workshop

Search tasks (SB and PI):

May	Data sets available for download
June 15	Task descriptions published
June 15	Topic sets available
Sept 4	Run submissions deadline
Sept 30	Release of relevance assessments and results

Active Reading task:

July 15	Submission of draft user study plan and request for up to 50 books
Sep 30	Submission deadline for user study results
Oct 20	Distribution of collected data

Structure Extraction Task:

March 1	Data available for download
May 3	Submissions of generated ToCs
June 7	Groundtruth annotations due
Sept 18	Result announcement and distribution of ground-truth

See SE competition web site for further details.

Organizers

Search tasks

Gabriella Kazai
Microsoft Research Cambridge
gabkaz@microsoft.com

Marijn Koolen
University of Amsterdam
M.H.A.Koolen@uva.nl

Jaap Kamps
University of Amsterdam
kamps@uva.nl

Structure Extraction task

Antoine Doucet
University of Caen
doucet@info.unicaen.fr

Active Reading task

Monica Landoni
University of Lugano
monica.landoni@unisi.ch

Imprint | Data protection | Contact someone about INEX