Overview

For centuries books were the dominant source of information, but how we acquire, share, and publish information is changing in fundamental ways due to the Web. The goal of the Books and Social Search Track is to investigate techniques to support users in searching and navigating the full texts of digitized books and complementary social media as well as providing a forum for the exchange of research ideas and contributions. Towards this goal the track is building appropriate evaluation benchmarks complete with test collections for focused, social and semantic search tasks. The track touches on a range of fields, including information retrieval (IR), information science (IS), human computer interaction (HCI), digital libraries (DL), and eBooks.

The Social Book Search Track runs two search tasks:

1) Social Book Search: The Social Book Search (SBS) task is the main task and aims to investigate the relative value of traditional book metadata and user-generated content for book search. Due to social media, there is much more information about books online than what's in library catalogues. Users of Amazon and LibraryThing review, rate and tag books and talk about them in discussion forums. Social media have made book search more complex: users have information needs that go beyond the information captured by library catalogue data. The SBS task aims to evaluate book search in such scenarios. Using book requests from the LibraryThing discussion forums and a collection of 2.8 million book descriptions from Amazon and LibraryThing, the task is to return a ranked list of book suggestions to the user.
2) Prove It (PI): Building on a corpus of over 50,000 digitized books, this task will test the application of both focused and semantic retrieval approaches to digitized books. Systems will need to return a ranked list of book pages containing information that can confirm or refute a factual statement, (optionally) indicating for each result whether it provides positive or negative evidence regarding the factual claim.

In addition to these search tasks, there are two related projects.

The Structure Extraction (SE) task runs as part of the ICDAR 2013 competition, this task builds on a set of 1,000 digitized books to evaluate automatic techniques for constructing hyperlinked table of contents from OCR text and layout information. Please contact Antoine Doucet for more information.

The Active Reading Task (ART) investigates how hardware and software for eBooks can support readers engaged in a variety of reading related activities such as fact finding, memory tasks and learning. The goal of the investigation is to derive user requirements and consequently design recommendations for more usable tools to support active reading practices for eBooks. For questions and more information, please contact Monica Landoni.

Participation

You can register here to participate in the INEX 2012 Social Book Search Track.

Participants will gain access to the 2012 data sets (see below) as well as the official INEX 2010-2011 topics and relevance judgements, which may be used for training.

Important! All participants are expected to submit a minimum of one run (or set of study results) to at least one of the tasks listed above. Only participants who contributed runs will gain access to relevance data collected in 2012. Optionally, participants may also contribute to the relevance assessments (either directly or via crowdsourcing).

Data sets

The social data collection contains metadata for 2.8 million books crawled from the online book store of Amazon and the social cataloging web site of LibraryThing in February and March 2009 by the University of Duisburg-Essen. The data set is in XML. Please fill in the license agreement to obtain access to the collection. For each book, the following metadata is included:
1. From Amazon: ISBN, title, binding, label, list price, number of pages, publisher, dimensions, reading level, release date, publication date, edition, Dewey classification, title page images, creators, similar products, height, width, length, weight, reviews (rating, author id, total votes, helpful votes, date, summary, content) editorial reviews (source, content).
2. From LibraryThing: Tags (including occurrence frequency), blurbs, dedications, epigraphs, first words, last words, quotations, series, awards, browse nodes, characters, places, subjects.
The book corpus consists of over 50,000 digitized, out-of-copyright books, provided by Microsoft Live Book Search and the Internet Archive (for non-commercial purposes only). The full texts of the books is in an XML format, referred to as BookML. Most books also have an associated metadata file (*.mrc), containing publication (author, title, etc.) and classification information in Machine-Readable Cataloging (MARC) record format. Details on the collection are available on the Download area of the Book Server. To access the Book Server, use your INEX login (you need to login once to the INEX site before). If your login doesn't work, please email Gabriella Kazai and provide your INEX username.
A corpus of 1,000 books in DjVu XML format and PDF is available for the SE task.

Schedule

7 May	SB+PI topics and task guidelines released
15 June	SB run submission
22 June	PI run submission
22 June	SB: LT forum evaluation results released
20 July	SB+PI: AMT evaluation results released
Mid August	Pre-proceedings working notes
17-20 september	CLEF conference and INEX@CLEF lab in Rome, Italy

Organizers

Search tasks

Gabriella Kazai
Microsoft Research Cambridge
gabkaz@microsoft.com

Marijn Koolen
University of Amsterdam
marijn.koolen@uva.nl

Jaap Kamps
University of Amsterdam
kamps@uva.nl

Michael Preminger
Oslo and Akershus University College of Applied Sciences
michaelp@hioa.no

Structure Extraction task

Antoine Doucet
University of Caen
doucet@info.unicaen.fr

Active Reading task

Monica Landoni
University of Lugano
monica.landoni@unisi.ch

Imprint | Data protection | Contact someone about INEX