A dataset to measure China biodiversity risk
Description
Extinctions of biological populations are becoming more frequent and have important implications for related sectors. As a result, the risks associated with biodiversity have received increasing attention and are considered to be entirely new risk factors. To understand the drivers of biodiversity risk, it is crucial to measure biodiversity risk at multiple levels, especially in developing countries. From perspectives of macro-government, meso-industry, and micro-companies, we use machine learning and text mining methods to measure the biodiversity risk of the Chinese market from 2000 to 2023, by using official media news texts, related fund holding data, and listed companies’ annual report texts. Specifically, our data features a measure of biodiversity risk in each of the three dimensions. Unlike previous biodiversity risk measurements, our data can reflect China's biodiversity risk from multiple perspectives, including macro-government, meso-industry, and micro-firms. Also our biodiversity risk data can be clustered on categorical domains such as time, city, and industry. As a result, our data can be matched with most relevant studies. Our biodiversity risk macro-data comes from the news data of Chinese mainstream media between 2013 and 2023, and we adopt a machine learning approach to text mining to obtain the biodiversity risk of 5,394 trading days. Our biodiversity risk meso-data comes from more than 40 funds related to conceptual themes such as ‘bioprotection’ listed between 2015 and 2023. Our micro-biodiversity risk indicators are extracted from the annual reports of 5,606 listed firms listed on the Shanghai Stock Exchange, Shenzhen Stock Exchange and Beijing Stock Exchange from 2000 to 2023.
Files
Steps to reproduce
Step 1: We retrieve a total of 38,260 articles from the China Knowledge Network (CNN) with biodiversity as the keyword, and used the keywords of these articles as the biodiversity keywords. After manual screening, a total of 451 biodiversity-related words were retained as keywords for biodiversity statements. Step 2: Text segmentation is performed to identify the text content by regular expressions with punctuation as the boundary to obtain the body content that meets the requirements. Step 3: Sentences containing biodiversity keywords are selected from all sentences and irrelevant sentences are eliminated. We sample an appropriate number of sentences from the text sets from annual reports and news, respectively. We label positive sentences as ‘1’, neutral sentences as ‘0’ and negative sentences as ‘-1’. Nine industry experts are invited to manually annotate the selected sentences. In order to maximally exclude subjective influence, we adopt a cross-labelling approach, where the sentences and their sentiment labels that end up as the training and validation sets have to be labelled by three experts who are not connected to each other. Step 4: We employ a pre-trained BERT model (hfl/chinese-roberta-wwm-ext), which is developed by HFL based on the RoBERTa architecture (Cui et al., 2021). We first use the BERT model to generate the validation report, and after obtaining excellent results, and then evaluate the sentiment of annual reports and news respectively, and finally assign sentiment labels to all sentences. Step 5: We combine the basic information of listed firms obtained from the CSMAR database with the WIND database with the sentence ratings of annual reports already obtained.
Institutions
Categories
Funding
National Natural Science Foundation of China
72201003
Young and Middle-Aged Teacher Training Action Programme in Anhui Province Universities
YQYB2024002
Philosophy and Social Science Planning Project of Anhui Province, China
AHSKQ2022D027
Research Project on Innovation and Development of Social Sciences in Anhui Province, China
2022CX031