FC-CoMIcs (Fuel Cell Corpus for Materials Informatics)
Description
We provide model-generated annotations extracted from the literature. This corpus contains automatically extracted annotations from ~1000 scientific articles related to Oxygen Reduction Reaction (ORR) catalysts for fuel cells. The annotations were generated using our trained Named Entity Recognition (NER) and Relation Extraction (RE) model. FC-CoMIcs is the corpus that can contribute to the advancement of Materials Informatics (MI) studies, particularly in the field of polymer electrolyte fuel cell. The dataset includes a text file listing the DOIs of the annotated articles, along with the corresponding Brat annotation files (.ann), which contain the extracted entities and relationships. The annotations follow a standard annotation scheme, ensuring usability for both Natural Language Processing (NLP) and fuel cell research.
Files
Steps to reproduce
The steps to reproduce the dataset from raw scientific literature. 1. Data Collection: Collect research papers (provided article's DOI list) related to ORR catalysts for fuel cells 2. Data Preprocessing: Convert PDF, XML or HTML articles into plain text 3. Data Modeling: Train an NER + RE model to recognize entity types and relationships types 4. Data Extraction: Apply the trained model to the preprocessed text to automatically extract entity mentions and relationships. 5. Dataset Packaging: Save the extracted entities and relationships in Brat annotation format (.ann) alongside the corresponding text files (.txt).