FORM-TRACE: a reproducible workflow for converting forest-management documents into auditable transition indicators
Description
This dataset provides the reproducibility package for FORM-TRACE, a workflow developed to convert heterogeneous forest-management documents into auditable transition indicators. It supports the MethodsX method article “FORM-TRACE: a reproducible workflow for converting forest-management documents into auditable transition indicators.” The package includes the transition definition, corpus manifest, extraction log, keyword-domain matrix, document-level domain scores, year-level and period-level score tables, run report, QA checklist, generated figures, and reproducibility documentation for the Harvard Forest / New England worked example covering 1908–2026. The workflow inspected 257 PDF files, identified 215 analytical forest-management documents, successfully extracted and scored 201 documents, flagged 14 failed or OCR-needed documents, and separated 42 methods-only references. FORM-TRACE uses deterministic keyword-domain scoring to generate normalized document and period-level signals for production and silviculture, ecological structure, carbon and ecosystem function, disturbance and climate risk, and governance implementation. AI-assisted tools were used only to support organization, methods-reference review, and manuscript preparation; final scoring was generated by deterministic scripts. Source PDFs and copyrighted source documents are not redistributed in this dataset. The manifest, metadata, logs, scoring outputs, and documentation are provided to support reproducibility with legally accessible source documents.
Files
Steps to reproduce
1. Download the FORM-TRACE dataset package and unzip the files into a local working folder. 2. Review `TRANSITION_DEFINITION.md` to understand the study scope, transition logic, historical period, geographic focus, and the five analytical domains: production and silviculture, ecological structure, carbon and ecosystem function, disturbance and climate risk, and governance implementation. 3. Open `CORPUS_MANIFEST.csv` to inspect the document inventory. This file records the analytical corpus, methods-only references, excluded/OCR-needed files, inferred metadata, document groups, and file-level notes. 4. Open `EXTRACTION_LOG.csv` to inspect text-extraction quality. This file records which documents were successfully extracted, which files failed, which required OCR, and the word/character counts used for scoring. 5. Open `KEYWORD_MATRIX.csv` to inspect the deterministic keyword-domain matrix. This file defines the terms and domain assignments used to identify documented signals for each FORM-TRACE domain. 6. Reproduce document-level scoring by applying the formula `S(d,k) = 1000 × N(d,k) / W(d)`, where `N(d,k)` is the number of keyword matches for domain `k` in document `d`, and `W(d)` is the document word count. The resulting scores are provided in `DOCUMENT_DOMAIN_SCORES.csv`. 7. Reproduce year-level and period-level aggregation using `DOMAIN_SCORES_BY_YEAR.csv` and `DOMAIN_SCORES_BY_PERIOD.csv`. Period-level values are arithmetic means of document-level normalized scores within each historical period. Periods with `n = 0` indicate no scored documents and should not be interpreted as zero domain signal. 8. Review `RUN_REPORT.md` to confirm the documented workflow results, including the number of PDFs inspected, analytical documents identified, documents successfully extracted and scored, failed/OCR-needed files, and methods-only references separated from scoring. 9. Review `QA_CHECKLIST.md` to confirm that no source PDFs are included, all core reproducibility files are present, deterministic scripts generated the scores, and AI was not used for final scoring. 10. Inspect the generated figures in the `figures/` folder. These figures were generated from the saved scoring and aggregation outputs and can be compared against the score tables. 11. To reproduce the workflow with the original document corpus, obtain the source documents legally from the original publishers, Harvard Forest, institutional repositories, or public URLs listed in the manifest where available. Source PDFs are not redistributed in this dataset because of copyright and licensing restrictions. 12. Re-run the workflow by using the same corpus manifest, keyword-domain matrix, scoring formula, and aggregation periods. Any new or substituted documents should be added to the manifest and extraction log before scoring so the audit trail remains complete.
Institutions
- Chulalongkorn UniversityBangkok, Bangkok