FC-CoMIcs (Fuel Cell Corpus for Materials Informatics)

Published: 11 April 2025| Version 1 | DOI: 10.17632/myxyfnmxzw.1
Contributors:
Hein Htet,
,
,

Description

We provide model-generated annotations extracted from the literature. This corpus contains automatically extracted annotations from ~1000 scientific articles related to Oxygen Reduction Reaction (ORR) catalysts for fuel cells. The annotations were generated using our trained Named Entity Recognition (NER) and Relation Extraction (RE) model. FC-CoMIcs is the corpus that can contribute to the advancement of Materials Informatics (MI) studies, particularly in the field of polymer electrolyte fuel cell. The dataset includes a text file listing the DOIs of the annotated articles, along with the corresponding Brat annotation files (.ann), which contain the extracted entities and relationships. The annotations follow a standard annotation scheme, ensuring usability for both Natural Language Processing (NLP) and fuel cell research.

Files

Steps to reproduce

The steps to reproduce the dataset from raw scientific literature. 1. Data Collection: Collect research papers (provided article's DOI list) related to ORR catalysts for fuel cells 2. Data Preprocessing: Convert PDF, XML or HTML articles into plain text 3. Data Modeling: Train an NER + RE model to recognize entity types and relationships types 4. Data Extraction: Apply the trained model to the preprocessed text to automatically extract entity mentions and relationships. 5. Dataset Packaging: Save the extracted entities and relationships in Brat annotation format (.ann) alongside the corresponding text files (.txt).

Institutions

Nagoya Daigaku, Toyota Kogyo Daigaku

Categories

Fuel Cell, Information Extraction, Informatics, Materials Application, Oxygen Reduction Reaction

Licence