Elsevier's data and code for the bioCADDIE 2016 Dataset Retrieval Challenge

Published: 05-06-2017| Version 1 | DOI: 10.17632/zd9dxpyybg.1
Peter Cotroneo


The Elsevier DataSearch (https://datasearch.elsevier.com) team participated in the bioCADDIE 2016 Dataset Retrieval Challenge. The results of the Challenge, along with the example and test queries, can be found here: https://biocaddie.org/biocaddie-2016-dataset-retrieval-challenge We have submitted a paper to DATABASE: The Journal of Biological Databases and Curation that details our work in the Challenge (to be published in the latter half of 2017). The attached file, elsevier-submission.zip, contains elsevier[1-5].txt, which correspond to the five-run submissions as described in the paper. The following describes the code that we developed for the Challenge: Aspire Content Processing by Search Technologies (https://www.searchtechnologies.com/en-gb/aspire): Dictionary.xml - Loads dictionaries (MeSH, Genes, Solr fields) into Aspire so that they can be used to identify concepts in text (document or query). QueryAnalyzer.xml - Receives a query, identifies concepts using the dictionaries and returns a response containing information about the concepts in the query. ProcessJSON.xml - Processes the JSON documents (Flattens the metadata; Identifies MeSH and Gene concepts and embeds them in the text; Prepares the document to be indexed by Solr). ProcessJSONSimple.xml - Enables JSON documents which have previously been created by ProcessJosn.xml to be sent to Solr without any further processing. This is much quicker than having to run ProcessJSONSimple.xml again; Prepares the document to be indexed by Solr. All other aspects of Aspire (Aspire framework, content source to process a folder of JSON files, submission to Solr) are standard Aspire features with no customisation. Solr: Biocaddie.qpl - QPL file for processing a search query by sending a request to QueryAnalyzer.xml in Aspire, parsing the response and constructing a Lucene query. Elsevier-solr.zip - Java project for a custom Solr Token Filter to index concept IDs in the same position as the words to which they relate. All other aspects of Solr are standard Solr or QPL.. Dictionary Creation: MeSH.groovy - Groovy script to convert a MeSH dictionary in ASCII format into a dictionary which can be used in Aspire. Genes.groovy - Groovy script to convert a Gene dictionary into a dictionary which can be used in Aspire. The file biocaddie-infosys-master_files.zip contains the following: SolrQueryGen - Generates Solr queries from text. It supports unigram, gazetteer lookup, lemmatisation and word embedding expansion. JudgementUI - UI for bioCADDIE manual judgments. Additional utilities: NLP4J - Natural language parsing (tokenisation, lemmatisation, part of speech tagging, etc.). PseudoRelevanceFeedback - Another approach, but not integrated. BioCaddieSpark – Apache Spark jobs to load data and process, index into Solr. BioCaddieServices - Backend services for Judgment UI. Any questions about the code should be directed to datasearch-support@elsevier.com.