Bob, the ACL Anthology test collection

Published: 12-01-2017| Version 1 | DOI: 10.17632/9rrvd2myjy.1
Anna Ritchie,
Bill Hollingsworth


This test collection, henceforth Bob, was created at the University of Cambridge and is intended for information retrieval experiments with scientific literature. Bob consists of: * documents.xml - almost 10,000 research papers from the ACL Anthology (the freely available digital archive of computational linguistics publications), packaged as one large XML document with <DOC> tags to delimit individual papers. These documents were processed individually using PTX, part of the Skimcast (TM) Semantic System; please see README_PTX for details and a reference. * queries - 82 research questions from authors of ACL Anthology papers, in three files: queries.txt (a plaintext file containing all 82 queries with their Anthology-based IDs and numeric IDs), queries.lemur (a Lemur-style query file) and queries.indri (an Indri-style queries file). * relevance judgements - judgements by the query authors as to the relevance of other papers in the ACL Anthology with respect to their queries, packaged together in the TREC-style qrels.txt (0==irrelevant, !0==relevant). CONDITIONS OF USE: Bob may be used solely for non-commercial purposes. When publishing work using Bob, please cite the PhD thesis of Anna Ritchie. Below are BibTeX entries for the thesis and further publications describing the creation of the test collection. @phdthesis{anna_ritchie_thesis, author = {Anna Ritchie}, title = {Citation Context Analysis for Information Retrieval}, year = {2008}, school = {University of Cambridge, UK}, } See for more related publications.