STOXX 3000 Sustainability Reporting Text Measures
Description
This dataset provides firm-level, text-derived indicators based on sustainability reporting by STOXX 3000 companies. The current version contains a binary variable identifying whether a firm-year sustainability report includes references to the Sustainable Development Goals (SDGs). The SDG indicator is constructed through automated text processing of publicly available sustainability reports. no copyrighted raw text is included in this dataset. Each observation is uniquely identified by ISIN and year and is accompanied by core firm metadata, including: - Reporting year - Country of incorporation - Industry sector (STOXX classification) - Index component classification (large, mid, or small cap) These variables allow users to merge the dataset easily with data from external financial or sustainability databases. The repository also includes replication code (R) used to generate the SDG-dummy variable and produce the empirical results for the associated publication. The code illustrates the text-processing workflow while ensuring that underlying copyrighted documents are not redistributed. Future versions of this dataset will extend the available text-derived measures. Planned additions include indicators constructed using alternative dictionaries, custom lexicons, or thematic classifications applied to sustainability disclosures. Only derived variables will be released; raw corporate text will not be shared. Intended use: Researchers can use the dataset for replication, robustness checks, comparative textual analysis, or as a foundation for expanded sustainability research on STOXX 3000 firms.
Files
Steps to reproduce
The data underlying this project were created through a combination of text collection, text processing, and the generation of statistically matched synthetic financial variables. Sustainability reports were collected from publicly available online sources using the Google Search API and were checked to ensure correctness and relevance. The reports were downloaded as PDFs and converted to text files for processing. Text analysis relied on the quanteda workflow: documents were loaded into a corpus, tokenized, and transformed into document-feature matrices. Using pattern-based filtering, SDG-related sentences were extracted, and these were used to construct yearly discursive SDG involvement indicators for each ISIN. These indicators originate entirely from publicly accessible sustainability report texts and are independent of any proprietary financial databases. Because financial data derived from commercial sources cannot be redistributed, synthetic financial variables were created instead. These were generated to reproduce the empirical distributions, means, and variances found in the original firm-level data, while containing no actual proprietary values. The resulting dataset data_synt.Rdata retains the real SDG dummy variables on ISIN-years but uses synthetic financials; users may replace the synthetic variables with their own firm-level data to reproduce or extend the regression analyses. This workflow provides a reproducible structure for text-based SDG variable construction and a safe environment for running the econometric models without sharing restricted data.
Institutions
- Open Universiteit Faculteit Management science en technologie