AIMER: A reproducible AI-assisted protocol for converting sustainability reports into ESG evidence and decision models

Published: 26 May 2026| Version 1 | DOI: 10.17632/vmyfg2zd2y.1
Contributor:
Nophea Sasaki

Description

AIMER (AI-assisted Method for ESG Report Interpretation and Reproducible analysis) is a reproducible research protocol for converting publicly accessible sustainability reports and corporate ESG disclosures into structured ESG evidence, coded datasets, and decision-oriented analytical outputs. The dataset supports the accompanying MethodsX manuscript by documenting the methodological workflow used for document acquisition, preprocessing, AI-assisted extraction, coding, validation, and evidence synthesis across corporate sustainability disclosures. The study uses publicly available sustainability reports, ESG reports, annual reports, and related corporate disclosures published online by reporting organizations. The dataset does not redistribute the original corporate reports because copyright remains with the respective publishers. Instead, the repository focuses on reproducible research outputs, including coded variables, extraction protocols, methodological documentation, source inventories, metadata fields, and validation structures necessary to support transparency and replication. Where available, the dataset records company names, report years, report titles, public URLs, and access dates to facilitate verification and reproducibility. AI-assisted tools, Python-based processing workflows, and human-reviewed coding procedures were used to support document interpretation and structured evidence extraction. Final analytical decisions, validation, and manuscript preparation were conducted under human supervision. This repository is intended for academic research, methodological transparency, ESG evidence synthesis, and reproducible sustainability analytics.

Files

Steps to reproduce

1. Download the full Mendeley Data package and open README.md first. The README explains how the deposited files relate to the MethodsX manuscript and identifies the evidence matrix, extraction-quality log, source inventory, codebook, correction report, and workflow scripts. 2. Install Python 3.8 or later. From the folder containing the downloaded package, install the required Python libraries: pip install -r workflow/requirements.txt 3. Validate the deposited corrected dataset by running: python workflow/validation_checks.py The expected validation results are: 15,236 evidence-matrix rows, 100 extraction-quality records, 100 successful report records, 0 OCR-needed/problem records, and 358 BCP 2025 evidence rows. These checks verify file structure and expected corrected counts; they do not replace manual claim-by-claim verification against original report pages. 4. To inspect or regenerate summary tables from the deposited evidence matrix, run: python workflow/evidence_mapping.py This produces construct totals, construct-by-year distributions, and company-year claim summaries from the corrected evidence matrix. 5. To create cleaned copies of the deposited CSV files with normalized whitespace and basic numeric typing, run: python workflow/data_cleaning.py --write 6. To re-run the extraction workflow from source PDFs, obtain lawful access to the original public corporate sustainability reports, ESG reports, annual reports, or disclosures. The original company reports are not redistributed in this dataset because copyright remains with the original publishers. Place the 2024 and 2025 PDFs in local folders, then run: python workflow/extraction_pipeline.py --input-dir-2024 "PATH_TO_2024_REPORT_PDFS" --input-dir-2025 "PATH_TO_2025_REPORT_PDFS" --output-dir derived_extraction This step is optional for reviewers who only wish to validate the deposited corrected dataset. It is required only for researchers who want to reproduce the extraction stage from original source documents. 7. Use AIMER_BCP_2025_SOURCE_CORRECTION_REPORT_v1.md to interpret the BCP 2025 correction. The final deposited dataset reflects the readable BCP 2025 source correction and should be treated as the corrected dataset supporting the manuscript.

Categories

Sustainability, Environment Variable

Licence