Large-Scale Dataset of Porous Carbon Materials for Supercapacitors Extracted via Large Language Models
Description
This dataset comprises a large-scale collection of experimental data on porous carbon supercapacitors, extracted from approximately 1,000 high-content research articles using an automated Large Language Model (LLM) mining framework. The database covers over 5,000 distinct carbon samples and includes more than 10,000 specific capacitance data points under varying testing conditions. It offers a holistic view of the material properties, integrating synthesis parameters (preparation crafts), microstructural characteristics (pore size distribution, surface area), and surface elemental composition. Furthermore, it details key electrochemical performance metrics, including Specific Capacitance, Energy Density, Power Density, and Equivalent Series Resistance (ESR). This dataset serves as a valuable resource for data-driven materials science, enabling quantitative analysis of structure-performance correlations and the inverse design of high-performance energy storage materials. Note: For further modeling use, detailed data cleaning may be required for specific domains The ml_use_data folder contains two datasets: the number_metric_dataset_cleaned.csv (9,962 entries), which serves as the cleaned feature pool, and the train_test_dataset.csv (284 entries), a subset derived by filtering out non-null features for model training and testing.
Files
Steps to reproduce
Literature Collection: Relevant research articles focusing on porous carbon materials for supercapacitors were retrieved from major scientific databases (e.g., ScienceDirect), filtering for experimental studies. Automated Extraction via LLM: A custom-built multi-agent workflow based on Large Language Models (LLM) was deployed. The workflow utilized a "Check-Fix-Recheck" self-correction mechanism to accurately extract structured data regarding synthesis parameters, microstructural properties (SSA, pore volume), and electrochemical performance from unstructured texts. Data Cleaning & Validation: The raw extracted data underwent rigorous cleaning. This included logic consistency checks (e.g., verifying physical constraints of pore volume) and outlier detection using the Interquartile Range (IQR) method to remove erroneous values. Standardization: All units were normalized to standard metric units (e.g., specific capacitance to F g⁻¹, pore diameter to nm) to ensure consistency across the dataset. Finally, entries with critical missing features were filtered out to maintain data quality for machine learning tasks.
Institutions
- China University of Mining and Technology - Beijing CampusBeijing
- Computer Network Information Center Chinese Academy of SciencesBeijing, Haidian District