Dataset Automatic scoring on Mole calculation

Published: 13 March 2024| Version 1 | DOI: 10.17632/z2nsknmksd.1


The training dataset designed for understanding and solving chemical calculations, specifically calculating the mass of compounds given the number of moles and molar mass, is a foundational resource for training NLP models in the deep learning domain. This dataset aims to equip models with the ability to accurately interpret and execute chemical calculations presented in textual form. Structured to support the training of advanced NLP models like BERT, GPT, or other transformer-based models, it comprises a series of calculation questions, numerical data for moles and molar mass, and the expected answers in units of mass. Each entry is annotated with additional information such as the calculation category, the formula used, and step-by-step explanations to facilitate model understanding. Presented in the JSON Lines (jsonl) format, this structured approach enables efficient batch processing and individual item analysis, making it an invaluable tool for developing NLP applications capable of performing quantitative chemical problem-solving. The application of this dataset extends beyond mere calculation to include natural language understanding within the context of chemistry, extracting numerical and contextual information, and generating human-comprehensible textual answers. Post-training, models are evaluated against a separate test dataset to ensure their capability to comprehend questions, extract relevant data accurately, and produce precise numerical answers. Evaluation metrics such as accuracy, precision, and recall in question understanding, along with the numerical accuracy of answers, demonstrate the model's performance. This dataset not only facilitates the research and development of NLP models that apply chemical knowledge to solve quantitative problems but also significantly advances AI's role in chemical education and research.


Steps to reproduce

The development of our chemical calculation dataset for NLP model training was meticulously designed to ensure comprehensive coverage and high accuracy. The project commenced with a clear definition of the objectives, aiming to encompass a broad spectrum of chemical calculations from basic stoichiometry to more complex reaction yield problems. We identified a diverse array of reputable sources for data acquisition, including academic textbooks, specialized chemical education websites, and peer-reviewed scientific journals, to ensure a wide representation of calculation types and complexities. A standardized data collection protocol was established, dictating the extraction process of chemical problems, the relevant numerical data, and their solutions, guaranteeing uniformity across the dataset. The use of digital tools was pivotal in this process; text extraction software facilitated the retrieval of data from various digital and scanned sources, while online tools for chemical equation balancing ensured the correctness of chemical reactions involved. All collected data were structured into a JSON Lines format for easy processing and integration into machine learning workflows, employing JSON editors to maintain the dataset's organization and readability. To ensure the dataset's accuracy and reliability, a rigorous manual review was conducted by a team of chemistry experts who verified each problem and solution for correctness. This step was crucial in maintaining the dataset's quality. Furthermore, the data underwent a detailed annotation process, where additional metadata, such as the difficulty level and specific chemical concepts addressed, were tagged to enhance the dataset's utility for targeted NLP applications. The final and perhaps most critical phase involved validation and testing, wherein preliminary NLP models were trained using subsets of the dataset to assess its effectiveness in training models accurately. Feedback from these tests informed further refinements, ensuring the dataset not only met our initial objectives but also provided a robust foundation for developing NLP models capable of interpreting and solving a wide range of chemical calculations. This comprehensive approach, from data collection through to testing and refinement, underscores our commitment to creating a high-quality dataset that supports the advancement of NLP applications in the field of chemical education and research.


Universitas Negeri Semarang, Universitas Sebelas Maret


Chemistry, Molecular Mechanics with Molecular Dynamics, Natural Language Processing, Machine Learning