FC-CoMIcs (Fuel Cell-Corpus for Materials Informatics)
Description
We provide automated knowledge extractions from scientific literature to support data-driven materials discovery. This corpus, FC-CoMIcs, contains semantic annotations automatically extracted from approximately 1,000 research articles focused on Oxygen Reduction Reaction (ORR) catalysts for fuel cells. The data was generated using a specialized DyGIE++ model fine-tuned on MatSciBERT, specifically trained to recognize complex materials, properties, and relationships in the fuel cell domain. FC-CoMIcs contributes to the advancement of Materials Informatics (MI), particularly for polymer electrolyte fuel cells, by providing structured data ready for Knowledge Graph construction and statistical analysis. The dataset consists of: 1. DOI Index: A mapping file linking our internal IDs to the original article Digital Object Identifiers (DOIs). 2. Structured JSON Extractions: Machine-generated files containing recognized entities (e.g., catalysts, precursors, operating conditions) and their semantic relationships. 3. Interactive Knowledge Maps: Visual representations of the extracted relations for each article, generated via the Pyvis library for rapid human exploration. Note on Accessibility: To comply with publisher copyright and funding guidelines, the original full-text articles are not redistributed. Researchers can use the provided DOIs to link these high-fidelity semantic extractions back to the source literature.
Files
Steps to reproduce
1. Literature Selection & Corpus Construction: A set of approximately 1,000 scientific articles related to Oxygen Reduction Reaction (ORR) catalysts and polymer electrolyte fuel cells was identified. The articles were indexed using their Digital Object Identifiers (DOIs) to create a persistent reference list (FC-CoMIcs-Article1000_DOI-list.txt). 2. Model Selection & Architecture: We employed the DyGIE++ framework, a multi-task learning model designed for entity recognition and relation extraction. The model utilized MatSciBERT as the underlying transformer encoder, which was pre-trained on a large-scale corpus of materials science literature to ensure domain-specific semantic understanding. 3. Model Training: The model was fine-tuned using a high-quality, expert-annotated corpus developed within our laboratory (funded by NEDO). This training set included complex hierarchical labels for catalysts, precursors, electrolytes, and their associated electrochemical properties. 4. Automated Extraction: The best-performing model iteration was applied to the 1,000 unlabelled articles. The model performed joint Named Entity Recognition (NER) and Relation Extraction (RE), identifying materials, conditions, and numerical values, and establishing the semantic links between them. 5. Data Formatting & Knowledge Mapping: - JSON Generation: The raw model outputs were converted into a structured JSON format. To ensure data privacy and copyright compliance, text indices and raw snippets were excluded, focusing purely on the extracted semantic triples. - Visualization: For each JSON file, an interactive knowledge graph was generated using the Pyvis library. Nodes were color-coded by entity type (e.g., Catalyst, Property, Value) to facilitate rapid visual exploration of the material-property relationships. 6. Validation: The resulting corpus was checked for structural integrity and alignment with the DOI index to ensure each JSON extraction accurately corresponds to the intended publication.
Institutions
- Nagoya UniversityAichi, Nagoya
- Toyota Technological InstituteAichi, Nagoya
Categories
Funders
- New Energy and Industrial Technology Development OrganizationKawasakiGrant ID: JPNP20003 and JPNP25002