Greenwashing Premium
Description
This dataset is the replication package for "The Greenwashing Premium: Satellite Evidence from US Heavy Emitters". It contains the programs, analysis-ready data, and instructions needed to reproduce every table and figure in the paper. The paper introduces a Claim-Reality Divergence (CRD) index for 233 US publicly listed heavy-emitting firms, 2018–2023. The index triangulates three measures: (i) ESG language intensity from SEC 10-K filings and earnings call transcripts, (ii) facility-level methane emissions reported to the EPA Greenhouse Gas Reporting Program (GHGRP), and (iii) atmospheric methane column densities observed by the Sentinel-5P TROPOMI satellite via Google Earth Engine. The package documents how a one standard deviation increase in ESG text intensity is associated with a 0.64–0.69 standard deviation increase in CRD, and that the divergence is concentrated in the voluntary investor-communication layer rather than in regulatory filings. The package contains: (1) two analysis-ready datasets in CSV format covering 1,937 firm-year observations, (2) a one-command reproduction script that regenerates all tables and figures in under 60 seconds, (3) the full data-construction pipeline from raw sources, (4) a README with replication instructions, and (5) a codebook documenting every variable. Two underlying raw data sources are proprietary (Compustat North America via WRDS; earnings call transcripts) and cannot be redistributed, but the analysis-ready datasets allow full reproduction of the paper's results without proprietary access. All other data sources (SEC EDGAR, EPA GHGRP, EPA ECHO, ESA Sentinel-5P TROPOMI) are public.
Files
Steps to reproduce
This package supports two reproduction tiers. Tier 1 reproduces every table and figure in the paper from the analysis-ready datasets included in the package. Tier 2 rebuilds those datasets from raw sources for replicators with proprietary data access. Tier 1 (recommended, runs in approximately 60 seconds): After unzipping the package, set up a Python 3.11 environment and install dependencies with pip install -r requirements.txt. Then run python code/reproduce_paper.py. This regenerates Tables 1, 2, A1, and A2 (in regression_results.txt), Figure 1 (in figure1_crd_quintiles.png), and a machine-readable Table 1 (in table1_summary_stats.csv). All outputs are written to the results/ folder. No proprietary data access is required. Tier 2 (optional, full pipeline, approximately 6 to 10 hours): Replicators with a WRDS Compustat subscription, an earnings-call transcripts corpus, a Google Earth Engine account, and EPA bulk downloads can rebuild the analysis-ready datasets from scratch. Set the required paths via environment variables (TRANSCRIPTS_DIR, FILINGS_DIR, GEE_PROJECT, WRDS_USER, and SEC_USER_AGENT), then run python run_all.py. The pipeline is resumable and skips steps with existing outputs. Complete documentation, pinned dependency versions, the mapping from each output file to the corresponding table or figure in the paper, and a variable-by-variable codebook are included in README.md and docs/codebook.md inside the package.
Institutions
- University of North FloridaFlorida, Jacksonville