SDG mentioning in corporate sustainability reports 2016/2020
This dataset contains the SDG mentioning frequencies in corporate sustainability reports of a two-year set of 300 large enterprises taken from the Stoxx Global 3000. It has three equal groups of USA, European and East-Asian (Japan, Korea, Taiwan or "JKT") companies. The sustainability reports of these 300 companies were collected from a database (corporateregister.com). All texts were analysed for the presence using a dictionary created by the author of characteristic SDG words taken from the SDG foundational documents (the text of the UN resolution) (SDG-dictionary.txt). The data set can be used to explore the sustainability reporting practices of large stocklisted companies in connection with financial and organizational variables. Additionally, the data can be used to explore other features of sustainability reporting, as the original document-feature matrix (dfm) has also been included.
Steps to reproduce
The dataset contains the ISIN code, the year and the SDG word frequencies. The document_ID contains a code for the type of report: SR = stand-alone sustainability report IR = integrated report GC = Global Compact Communication of Progress Report ER = environmental report HR = human resources report CR = Climate related financial report GR = GRI content index (separately published) For each company, all reports were retrieved as they appeared in the corporateregister.com database. Some companies have more than one report. If desired, the scores per SDG can be merged to have one score per company. The data was used to estimate the weight of the different SDG-topics in the reports. The frequencies are available as absolute and relative counts (weighted on the number of words in the document). 1. Get Stoxx Global 3000 list. 2. Select 100 large companies from each country group using propensity matching on company size (log assets). These are in the file "company_list.csv". 3. Collect sustainability reports in PDF form 4. Convert PDF to text 5. Make a corpus and tokenize, removing stopwords and company names from text 6. Convert tokenized text to a document-feature matrix (dfm) 7. Create SDG dictionary. This is the file "sdg-dictionary.txt", included here just for reference. 8. Map SDG dictionary on dfm, absolute or weighted 9. Export output to data file. These are the files "SDG_frequencies_absolute.txt" and "SDG_frequencies-weighted.txt". The files have 545 documents, from 250 unique companies. Some companies have more than one report per year. You can merge the scores if you want or only select the document type that is of interest to you. The missing 50 companies did not publish a sustainability report in the years 2016 or 2020. Comparing the ISINs from the comapny list with the SDG_frequencies files will show which companies did not not publish a report. 10. Merge the company list with the SDG frequencies files. Data structure: doc_id: file name of corporate report containing ISIN, year, type of report and serial number if more than one report (e.g. for report plus separate attachment, like data report). sdg01-17 and sdg: SDG word counts, absolute or relative (relative is count divided by total word count of report (dfm)) gc/gri/int: word counts related to Global Compact/Global Reporting Initiative and IIRC/<IR> integrated reporting standard. A score higher than 0 is indicative of the company being GC member or following the GRI or IIRC reporting standard . country, year and company ISIN are extracted from the doc_id. All the data processing was performed with the R package "quanteda" by Benoit, K., Watanabe, K., Wang, H., Nulty, P., Obeng, A., Müller, S., & Matsuo, A. (2018). quanteda: An R package for the quantitative analysis of textual data. Journal of Open Source Software, 3(30), 774. https://doi.org/10.21105/joss.00774 For the quanteda tutorial, see: https://tutorials.quanteda.io/