Minimal reproducible data pack for PCOS environmental health inequality (1990–2023)
Description
The minimally reproducible dataset and rebuild instructions supporting the study of environmental determinants and PCOS health inequality across 204 countries and territories (1990–2023; GBD outcomes through 2023). The pack includes filtered tables for representative countries (e.g., China, United States, India, Brazil, Nigeria), schema definitions, table metadata, and a manifest with checksums. A script is provided to rebuild a compact SQLite database from the CSV files. Original source data (e.g., GBD, WHO, World Bank, ERA5) must be obtained under their respective terms; acquisition and preprocessing steps are documented in the repository. Main archive: MIN_PACK_20251029.zip (SHA256: c7bffe6b0c089e636f4e424e7267b40106a1545a15526b2ea2f41944a09fc11f; size: 540,683,840 bytes). See README_MIN_PACK.txt and MANIFEST.json inside the archive.
Files
Steps to reproduce
Overview Minimally reproducible data pack for analyses of environmental determinants and PCOS health inequality across 204 countries/territories. Temporal coverage: 1990–2023 (GBD outcomes through 2023). Pack includes filtered CSV.gz tables (representative countries), schema.sql, tables_list.json, MANIFEST.json (checksums), README_MIN_PACK.txt, and a rebuild script. Data sources (obtain from providers under their terms) GBD outcomes (IHME), WHO Global Health Observatory, World Bank WDI (incl. PPP/GDP), ERA5 climate (ECMWF), additional environmental indicators (e.g., PM2.5, CO₂ per‑capita). Only derived/filtered tables are redistributed here; raw sources are not included. Environment Python 3.10+. Dependencies: PyYAML (≥6.0). Pandas optional. OS agnostic (tested on Windows 10/11). Workflow to produce this pack Acquire and clean original datasets from providers; standardize country/year columns, units, and formats (per repository docs). Integrate into a single SQLite DB (local, not redistributed due to size). Enforce per‑source coverage via config/data_sources_coverage.yaml (analysis window 1990–2023; GBD through 2023). Extract minimal pack using tools/extract_min_reproducible_pack.py with config/min_pack_config.yaml: Countries: CHN, USA, IND, BRA, NGA (representative set). Include tables: pcos_%, gbd_%, who_env, world_bank, world_bank_%, era5_climate, pm25_pollution, co2_emissions, ppp_gdp_data, sev_data%. Exclude: sqlite_%, %backup%, %staging%. Special handling: • era5_climate uses country (English name) + year. • ppp_gdp_data uses country_name (English) + year (country_code is not ISO3). Package and validate: Archive: MIN_PACK_20251029.zip (size: 540,683,840 bytes). SHA256(zip): c7bffe6b0c089e636f4e424e7267b40106a1545a15526b2ea2f41944a09fc11f. Per‑file checksums in MANIFEST.json and MIN_PACK_20251029_CHECKSUMS.txt. How to rebuild from this pack Unzip MIN_PACK_20251029.zip. Run rebuild_db_from_min_pack.py to create reconstructed_min.db from CSV.gz using schema.sql. Verify integrity using provided SHA256 checksums. Notes Deterministic export; if a large table lacks filterable country/year columns, a fixed head sample is used (documented in MANIFEST.json). era5_climate and ppp_gdp_data filter by English country names + year. License: Data CC BY 4.0; Code MIT (see repository). Please cite the dataset DOI, the software repository, and original providers per their terms.
Institutions
- Guangdong Pharmaceutical University