ChEMU-Ref dataset for Modeling Anaphora Resolution in the Chemical Domain

Published: 21-01-2021| Version 1 | DOI: 10.17632/r28xxr6p92.1
Biaoyan Fang,
Christian Druckenbrodt,
Colleen Yeow Hui Shiuan,
Sacha Novakovic,
Ralph Hössel,
Saber A. Akhondi,
Jiayuan He,
Meladel Mistica,
Timothy Baldwin,
Karin Verspoor


In biochemistry, chemical compounds play an important role in pharmaceutical research and can help to save many lives from severe diseases. For chemical compound analysis, the discovery of compounds is usually first presented in chemical patents, which makes patent corpus analysis important for biochemical research. However, extracting actionable knowledge from corpus data has for some time been recognised as a bottleneck for drug discovery. To tackle this bottleneck, an information extraction system that automatically decomposes the research results, specifically chemical patents, into structured data, can be useful in facilitating the process of finding, relating, and reasoning for drug discovery. To build this kind of information extraction system, one of the most critical tasks is to extract reaction information, including chemical products, reaction conditions, the interaction of different products, etc. However, in natural language text, including biochemical literature, there are various referring relationships needed to be concerned among expressions and it is critical in understanding text. For instance, linguistic “short cuts" (pronouns, abbreviations, etc.) is applied to avoid redundancy in repeating names or complex descriptions. This is one of the major obstacles that limit the performance of information extraction systems since systems need to figure out which entity is referred to in a given context. In chemistry, figuring out the referring relations is more challenging as it needs not only common knowledge but also chemical knowledge. Therefore, to tackle anaphora resolution in the chemical domain, this ChEMU-Ref dataset is created.