Minimal reproducible data pack for PCOS environmental health inequality (1990–2023)

Published: 30 October 2025| Version 1 | DOI: 10.17632/z4cn493yjb.1
Contributor:
Zhongfeng Shi

Description

The minimally reproducible dataset and rebuild instructions supporting the study of environmental determinants and PCOS health inequality across 204 countries and territories (1990–2023; GBD outcomes through 2023). The pack includes filtered tables for representative countries (e.g., China, United States, India, Brazil, Nigeria), schema definitions, table metadata, and a manifest with checksums. A script is provided to rebuild a compact SQLite database from the CSV files. Original source data (e.g., GBD, WHO, World Bank, ERA5) must be obtained under their respective terms; acquisition and preprocessing steps are documented in the repository. Main archive: MIN_PACK_20251029.zip (SHA256: c7bffe6b0c089e636f4e424e7267b40106a1545a15526b2ea2f41944a09fc11f; size: 540,683,840 bytes). See README_MIN_PACK.txt and MANIFEST.json inside the archive.

Files

Steps to reproduce

Overview Minimally reproducible data pack for analyses of environmental determinants and PCOS health inequality across 204 countries/territories. Temporal coverage: 1990–2023 (GBD outcomes through 2023). Pack includes filtered CSV.gz tables (representative countries), schema.sql, tables_list.json, MANIFEST.json (checksums), README_MIN_PACK.txt, and a rebuild script. Data sources (obtain from providers under their terms) GBD outcomes (IHME), WHO Global Health Observatory, World Bank WDI (incl. PPP/GDP), ERA5 climate (ECMWF), additional environmental indicators (e.g., PM2.5, CO₂ per‑capita). Only derived/filtered tables are redistributed here; raw sources are not included. Environment Python 3.10+. Dependencies: PyYAML (≥6.0). Pandas optional. OS agnostic (tested on Windows 10/11). Workflow to produce this pack Acquire and clean original datasets from providers; standardize country/year columns, units, and formats (per repository docs). Integrate into a single SQLite DB (local, not redistributed due to size). Enforce per‑source coverage via config/data_sources_coverage.yaml (analysis window 1990–2023; GBD through 2023). Extract minimal pack using tools/extract_min_reproducible_pack.py with config/min_pack_config.yaml: Countries: CHN, USA, IND, BRA, NGA (representative set). Include tables: pcos_%, gbd_%, who_env, world_bank, world_bank_%, era5_climate, pm25_pollution, co2_emissions, ppp_gdp_data, sev_data%. Exclude: sqlite_%, %backup%, %staging%. Special handling: • era5_climate uses country (English name) + year. • ppp_gdp_data uses country_name (English) + year (country_code is not ISO3). Package and validate: Archive: MIN_PACK_20251029.zip (size: 540,683,840 bytes). SHA256(zip): c7bffe6b0c089e636f4e424e7267b40106a1545a15526b2ea2f41944a09fc11f. Per‑file checksums in MANIFEST.json and MIN_PACK_20251029_CHECKSUMS.txt. How to rebuild from this pack Unzip MIN_PACK_20251029.zip. Run rebuild_db_from_min_pack.py to create reconstructed_min.db from CSV.gz using schema.sql. Verify integrity using provided SHA256 checksums. Notes Deterministic export; if a large table lacks filterable country/year columns, a fixed head sample is used (documented in MANIFEST.json). era5_climate and ppp_gdp_data filter by English country names + year. License: Data CC BY 4.0; Code MIT (see repository). Please cite the dataset DOI, the software repository, and original providers per their terms.

Institutions

  • Guangdong Pharmaceutical University

Categories

Environment and Health, Global Health

Licence