COUNTY ECONOMY DATA
Description
This dataset contains panel data of 2,036 county-level administrative divisions in China (including counties, county-level cities, and municipal districts) from 2013 to 2019, totaling 14,252 observations. It is constructed specifically for researching transport economics, regional development, and evaluating the policy impacts of infrastructure development. The primary research application of this dataset is to analyze the direct economic impacts, industrial structure upgrading, and spatial spillover effects of High-Speed Railway (HSR) station openings using Spatial Difference-in-Differences (SDID) models. Key Dataset Characteristics Temporal Coverage: 2013–2019 (Annual frequency) Geographical Coverage: County-level divisions across Mainland China, categorized into 7 macro-economic regions. Methodological Applicability: Ideal for multitemporal policy evaluation, classic Difference-in-Differences (DID), and spatial econometric modeling (SAR, SEM, SDM, SAC). Variable Descriptions The variables in the dataset are categorized as follows: 1. Basic Identifiers & Spatial Categories region: Macro-region code (1 to 7). province_num: Name of the parent province. county_num: Name of the county-level unit. year: Observation year (2013–2019). d1 to d7: Regional dummy variables representing South Southwest, South China, Central China, North China, East China, Northwest, and Northeast regions. 2. Treatment / Policy Variable station: High-Speed Railway (HSR) station opening dummy variable ($0$ = No HSR station open in the county in that year; $1$ = HSR station opened in or before that year). 3. Economic Output Indicators (Dependent Variables) gdp / ln_gdp: Gross Domestic Product of the county (in ten thousand RMB) and its natural logarithm. GDP_primary: Value-added of the primary (agricultural) industry. GDP_secondary: Value-added of the secondary (industrial/construction) industry. GDP_tertiary / ln_GDP_tertiary: Value-added of the tertiary (service) industry and its natural logarithm. 4. Control & Spatial Density Variables secondary_employee / tertiary_employee / ln_tertiary_employee: Employment numbers and logarithms for industrial and service sectors. Residential savings...: Year-end balance of savings deposits of rural and urban residents. loan balance of...: Year-end loan balance of local financial institutions. area_county: Administrative land area of the county (in $km^2$). economic_density / job_density: Economic output and employment density per unit of land area. P_j_i / entropy_gdp: Industrial structure, optimization ratios, and entropy-based economic diversification indexes. Potential Research Value This dataset is highly valuable for researchers studying how mega-infrastructure projects impact micro-level economies, especially in exploring the trade-offs between local economic growth and the "siphon" (agglomeration) effects on neighboring peripheral areas.
Files
Steps to reproduce
Step 1: Data Extraction and Digital Conversion Socioeconomic Data Extraction: PDF versions of the China County Statistical Yearbook (2014–2020) were programmatically parsed using the Python library pdfplumber to extract raw data tables into tabular CSV structures. Web Scraping HSR Data: Information on HSR network operations was programmatically scraped from official railway schedules and announcements using Python’s BeautifulSoup and Scrapy frameworks. Geospatial Shapefiles: National GIS shapefiles of Chinese county boundaries (scale of 1:1,000,000) were obtained to compute land area statistics and geographical coordinates. Step 2: Spatial Geoprocessing and Centroid Computations To construct the spatial link weight matrix ($W_{ij}$), exact spatial centroids of each county-level division were calculated using R software (specifically the sf package). Projection Coordinate System: Geospatial boundary vector data was projected onto the Krasovsky 1940 Albers coordinate system to ensure accurate metric distance computations across China’s vast landmass. Great-Circle Distance Computations: The centroid coordinates (latitude and longitude) of each county were extracted. The distances between all pairs of counties ($i$ and $j$) were calculated using the Haversine formula: $$d_{ij} = 2 R \arcsin \left( \sqrt{\sin^2\left(\frac{\Delta \phi}{2}\right) + \cos(\phi_i) \cos(\phi_j) \sin^2\left(\frac{\Delta \lambda}{2}\right)} \right)$$ Step 3: Data Cleaning, Matching, and Harmonization Protocols Due to administrative mergers, splits, and renamings in China between 2013 and 2019, a strict administrative harmonization protocol was enforced: Division Code Matching: All records were mapped using the official GB/T 2260 Administrative Division Codes. Counties that underwent boundary changes or municipal conversions (e.g., a county converted into a city district) were harmonized to their 2019 administrative boundaries to preserve panel integrity. Missing Value Treatment: For administrative units with missing data in local yearbooks (representing $<1.5\%$ of the total dataset), linear interpolation was applied using adjacent temporal observations, provided that no more than two consecutive years were missing: $$x_{it} = x_{i,t-1} + \frac{x_{i,t+k} - x_{i,t-1}}{k+1}$$ Units with chronic data gaps exceeding three consecutive years were excluded entirely, resulting in the final selection of 2,036 counties. Outlier Handling (Winsorization): Extreme outliers caused by localized accounting revisions in statistical yearbooks were winsorized at the 1st and 99th percentiles of their annual distribution across variables such as gdp and industrial output values. Step 4: Analytical Calculations and Quality Assurance Derived indicators (such as economic_density, job_density, ins, and entropy_gdp) were programmatically generated in Python (Pandas & NumPy) and R using the mathematical definitions detailed in Section 7.
Institutions
- Sichuan University of Science and EngineeringSichuan, Zigong