Greenwashing Premium

Name: Greenwashing Premium
Creator: Pieter de Jong
Published: 2026-05-18T14:55:58.461Z
Keywords: Energy Economics

de Jong, Pieter

doi:10.17632/kgdj6gx38t.2

Greenwashing Premium

Published: 18 May 2026| Version 2 | DOI: 10.17632/kgdj6gx38t.2

Contributor:

Pieter de Jong

Description

This dataset is the replication package for Satellite-Verified Greenwashing: ESG Claim Credibility Among US Heavy Emitters. It contains the programs, analysis-ready data, and instructions needed to reproduce every table and figure in the paper. The paper introduces a Claim-Reality Divergence (CRD) index for 233 US publicly listed heavy-emitting firms, 2018–2023. The index triangulates three measures: (i) ESG language intensity from SEC 10-K filings and earnings call transcripts, (ii) facility-level methane emissions reported to the EPA Greenhouse Gas Reporting Program (GHGRP), and (iii) atmospheric methane column densities observed by the Sentinel-5P TROPOMI satellite via Google Earth Engine. The package documents how a one standard deviation increase in ESG text intensity is associated with a 0.64–0.69 standard deviation increase in claim-reality gaps, and that the divergence operates through selective investor-facing disclosure rather than falsification of regulatory data. The package contains: (1) two analysis-ready datasets in CSV format covering 1,937 firm-year observations, (2) a one-command reproduction script that regenerates all tables and figures in under 60 seconds, (3) the full data-construction pipeline from raw sources, (4) a README with replication instructions, and (5) a codebook documenting every variable. Two underlying raw data sources are proprietary (Compustat North America via WRDS; earnings call transcripts) and cannot be redistributed, but the analysis-ready datasets allow full reproduction of the paper's results without proprietary access. All other data sources (SEC EDGAR, EPA GHGRP, EPA ECHO, ESA Sentinel-5P TROPOMI) are public.

Files

Steps to reproduce

This package supports two reproduction tiers. Tier 1 reproduces every table and figure in the paper from the analysis-ready datasets included in the package. Tier 2 rebuilds those datasets from raw sources for replicators with proprietary data access. Tier 1 (recommended, runs in approximately 60 seconds): After unzipping the package, set up a Python 3.11 environment and install dependencies with pip install -r requirements.txt. Then run python code/reproduce_paper.py. This regenerates Tables 1, 2, A1, and A2 (in regression_results.txt), Figure 1 (in figure1_crd_quintiles.png), and a machine-readable Table 1 (in table1_summary_stats.csv). All outputs are written to the results/ folder. No proprietary data access is required. Tier 2 (optional, full pipeline, approximately 6 to 10 hours): Replicators with a WRDS Compustat subscription, an earnings-call transcripts corpus, a Google Earth Engine account, and EPA bulk downloads can rebuild the analysis-ready datasets from scratch. Set the required paths via environment variables (TRANSCRIPTS_DIR, FILINGS_DIR, GEE_PROJECT, WRDS_USER, and SEC_USER_AGENT), then run python run_all.py. The pipeline is resumable and skips steps with existing outputs. Complete documentation, pinned dependency versions, the mapping from each output file to the corresponding table or figure in the paper, and a variable-by-variable codebook are included in README.md and docs/codebook.md inside the package.

Institutions

University of North Florida
Florida, Jacksonville

Greenwashing Premium

Description

Files

Steps to reproduce

Institutions

Categories

Licence