Towards Realistic Known-item Topics for the ClueWeb

Claudia Hauff, Matthias Hagen, Anna Beyer, B. Stein

Research output: Chapter in Book/Conference proceedings/Edited volumeConference contributionScientificpeer-review

3 Citations (Scopus)


Known-item finding is the task of re-finding and re-accessing an item previously seen. Typical examples of known items include accessed Web sites, received emails, or documents on one's personal desktop. Current research on known-item finding heavily relies on corpora of known-item queries and the respective known items. However, many existing corpora are proprietary and not available to the public (in particular those derived from Web query logs), a fact which does not allow for repeatable research. The existing publicly available corpora either contain automatically generated queries or queries that were manually generated while seeing the known item itself. Hence, we consider these public corpora to be rather artificial in nature.

In this paper, we propose a methodology to create a known-item topic set that is much more realistic and that is built on top of a large-scale public test corpus. From know-item questions posted on the popular Yahoo! Answers platform we extract queries for known-items in a crowdsourcing setup. Since we ensure that all the known-items correspond to Web pages in the publicly available ClueWeb09 corpus (a large static Web crawl), we provide an environment for repeatable realistic Web-scale known-item searches.
Original languageEnglish
Title of host publicationIIIX'12 Proceedings of the 4th Information Interaction in Context Symposium
Place of PublicationNew York
PublisherAssociation for Computing Machinery (ACM)
Number of pages4
ISBN (Electronic)978-1-4503-1282-0
Publication statusPublished - 21 Aug 2012
EventThe 4th Information Interaction in Context Symposium: IIIX'12 - Nijmegen, Netherlands
Duration: 21 Aug 201224 Aug 2012


ConferenceThe 4th Information Interaction in Context Symposium


  • ClueWeb
  • known-item


Dive into the research topics of 'Towards Realistic Known-item Topics for the ClueWeb'. Together they form a unique fingerprint.

Cite this