Home Health Agency Cost Report (U.S., FY2022) — Cleaned/Validated Data
Description
Cleaned and validated U.S. CMS Home Health Agency (HHA) Cost Report for FY2022. Includes the cleaned CSV (201 cols; 10,564 rows), a tidy provider-year table of total operating expenses (USD), validation metrics with a 47-row issues sample, and the exact Python scripts + YAML to reproduce. Contents: CostReporthha_Final_22_clean.csv; labels.csv; DATA_DICTIONARY.csv; targets_long.csv; VALIDATION/hha22_validation.json and hha22_issues.csv; PROCESSING/ (scripts + config); README.md; LICENSE.txt; optional figure. Quality summary: CCN validity 100%; state codes ~99.6%; date start/end parse 100%/100%; cost numeric coverage ~69.2%; rows 10,564; columns 201. Provenance & ethics: CMS Provider Data Catalog — “Home Health Agency Cost Report” (file CostReporthha_Final_22.csv). No PHI/PII; provider IDs are institutional CCNs. License: CC BY 4.0. Reuse: Suitable for provider-year health-economics analyses, benchmarking cleaning/validation pipelines, and teaching reproducible curation. Reproducibility: See README. (A) Regenerate artifacts from the included cleaned CSV; or (B) optionally place the CMS raw CSV in this folder and run the clean → validate → artifacts scripts.
Files
Steps to reproduce
Environment - Python 3.9+; install once: pip install pandas numpy pyyaml matplotlib Run from this dataset folder (the one with README.md) Option A — Recreate artifacts from the CLEANED CSV included here (no raw file needed) 1) Validate: python PROCESSING/validate_cost_report.py --csv CostReporthha_Final_22_clean.csv --out VALIDATION/hha22_validation.json --issues VALIDATION/hha22_issues.csv --config PROCESSING/config/dq_healthcare_cost.yml 2) Generate tidy artifacts: python PROCESSING/make_dib_artifacts.py 3) (Optional) Figure: python PROCESSING/plot_histogram.py Option B — Full rebuild starting from the RAW CMS file (optional) 1) Download CMS “CostReporthha_Final_22.csv” and save it in THIS folder (path: ./CostReporthha_Final_22.csv). 2) Clean → Validate → Artifacts: python PROCESSING/clean_cost_report.py --csv_in CostReporthha_Final_22.csv --csv_out CostReporthha_Final_22_clean.csv --config PROCESSING/config/dq_healthcare_cost.yml python PROCESSING/validate_cost_report.py --csv CostReporthha_Final_22_clean.csv --out VALIDATION/hha22_validation.json --issues VALIDATION/hha22_issues.csv --config PROCESSING/config/dq_healthcare_cost.yml python PROCESSING/make_dib_artifacts.py 3) (Optional) Figure: python PROCESSING/plot_histogram.py Outputs - VALIDATION/hha22_validation.json, VALIDATION/hha22_issues.csv - labels.csv, targets_long.csv, DATA_DICTIONARY.csv