The Weibo blogs dataset related to the keyword "land expropriation (征地)" (in China, from April 2011 to December 2021).

Published: 17 June 2024| Version 1 | DOI: 10.17632/m5fmbd3v5d.1
Shiyun Tang


1. Weibo Blogs and Time Information. We used a web crawler to extract 408,199 Weibo texts containing ‘land expropriation’ from August 4, 2011, to December 31, 2021. Subsequently, the Weibo texts were filtered based on the inclusion of the keyword "land expropriation" (mainly to exclude irrelevant posts from the crawled data), resulting in a total of 364,071 relevant data entries. 2. Weibo Geographic Information. The (chinese_province_city_area_mapper)[] tool extracted geographical information from Weibo texts, identifying provincial-level data from 249,023 texts, city-level data from 204,227 texts, and district-level data from 105,813 texts. Observations revealed that local media and opinion leaders often included their geographical location in their user nicknames. This allowed for extracting additional geographical information from user names, supplementing the Weibo texts lacking geographical details. This process resulted in 264,111 provincial-level, 223,722 city-level, and 106,361 district-level geographical entries. By utilizing the Baidu Maps API, we supplemented the latitude and longitude information based on the extracted geographical location information. Among them, the "similarity" field is the matching degree of the geographical location description returned by the Baidu Maps API. 3. For the protection of privacy, the original text fields (including blog text and nick_name) have been omitted. Decription of the data columns: - 'info_key': the index of this blog in the orginal text dataset. - 'zhengdi': =1, for all the filtered bolgs contains the keyword "land expropriation (征地)". - 'Province', 'City','adcode': the geo info abstracted from the text and nick_name - 'date','hour', 'min': the time tag of this Weibo blog. - 'location', 'Lon', 'Lat' : geo information returned by Baidu Map API. - 'similarity': the matching degree of the geographical location description returned by the Baidu Maps API. - 'Province_zh', 'City_zh','District_zh' : the simplified Chinese version of geo info.


Steps to reproduce

We used a web crawler to extract 408,199 Weibo texts containing ‘land expropriation’ from August 4, 2011, to December 31, 2021. Subsequently, the Weibo texts were filtered based on the inclusion of the keyword "land expropriation" (mainly to exclude irrelevant posts from the crawled data), resulting in a total of 364,071 relevant data entries. Workflow: (1) Use the browser emulator crawler software (such as playwright) to enter the search page of Weibo's official website; (2) Simulate login to Weibo account; (3) Set the keyword "land acquisition" and set the date. Due to the limited number of search results displayed on Weibo's official website, we searched day by day. Repeat the above process until all days of data are obtained. (4) wash the dataset, and abstract information we need.


Renmin University of China Department of Economics


