ChEMU dataset for information extraction from chemical patents
The discovery of new chemical compounds and their synthesis process is of great importance to the chemical industry. Patent documents contain critical and timely information about newly discovered chemical compounds, providing a rich resource for chemical research in both academia and industry. Chemical patents are often the initial venues where a new chemical compound is disclosed. Only a small proportion of chemical compounds are ever published in journals and these publications can be delayed by up to 3 years after the patent disclosure. In addition, chemical patent documents usually contain unique information, such as reaction steps and experimental conditions for compound synthesis and mode of action. These details are crucial for the understanding of compound prior art, and provide a means for novelty checking and validation. Due to the high volume of chemical patents, approaches that enable automatic information extraction from these patents are in demand. To develop natural language processing methods for large-scale mining of chemical information from patent texts, a corpus is created providing chemical patent snippets and annotated entities and reaction steps.