DeepSeek AI–Fintech Regulatory Shock Panel Dataset: Comparator-Adjusted Firm-Quarter Data, 2019–2024

Published: 13 April 2026| Version 1 | DOI: 10.17632/bhs8m6bnzc.1
Contributor:
marco BONELLI

Description

This dataset supports the study “Regulatory Shocks and Cost Efficiency in China’s AI–Fintech Sector: Quasi-Experimental Evidence from DeepSeek and Comparator Firms.” It provides a comparator-adjusted firm-quarter panel for 2019–2024 constructed to examine how China’s 2021 data-governance shock, centered on the Data Security Law (DSL) and Personal Information Protection Law (PIPL), affected compliance/legal cost share, infrastructure cost share, and modeled gross-margin resilience in an AI–fintech setting. The focal firm is DeepSeek, treated as a high-exposure private AI firm operating in a finance-related, data-sensitive domain. Public comparator firms include SenseTime, Baidu AI, and iFlyTek, with segment extraction and scope-adjustment procedures used to improve comparability across firms that differ in scale, diversification, and business-model breadth. The dataset is designed for firm-quarter analysis and for replication of the event-study, difference-in-differences, and robustness procedures reported in the associated manuscript. The package includes: (1) the main firm-quarter panel; (2) a variable dictionary/codebook; (3) source-mapping and provenance materials; (4) comparator-scaling and haircut assumptions; (5) validation sheets; and (6) simulation-related inputs used for the paper’s Monte Carlo risk-translation extension. Key variables include compliance/legal cost share (ℓ), infrastructure cost share (cinfra), modeled gross-margin proxy (GM*), pre-shock exposure intensity, software-process intensity (SPI), customer churn proxies, competitor pricing-pressure proxies, and related benchmarking fields. All data are derived from public or public-facing materials, including company filings, policy documents, industry and analyst sources, public operating signals, and documented transformations. No proprietary internal company records are included. DeepSeek-related values are inferred through transparent public-data proxy construction rather than direct access to internal accounts. This deposit is intended to improve transparency and reproducibility in a context where private AI firms disclose limited internal financial information. It should be interpreted as a structured research dataset for empirical replication, robustness analysis, and comparative benchmarking rather than as a source of official company accounts.

Files

Steps to reproduce

1. Open the master workbook and read the README and variable dictionary sheets first. These explain file structure, variable names, units, sheet purposes, and the distinction between raw public-source proxies, transformed comparator-adjusted values, and analysis-ready fields. 2. Use the main firm-quarter panel sheet as the core analytical dataset. The unit of observation is the firm-quarter, and the time coverage is 2019Q1–2024Q4. Confirm firm identifiers, quarter identifiers, and outcome variables before estimation. 3. Reconstruct the main dependent variables from the documented fields where needed. The principal outcomes are legal/compliance cost share (ℓ), infrastructure cost share (cinfra), and the modeled gross-margin proxy (GM*). GM* is defined as 1 − (cinfra + sR&D + ℓ + CACshare), with CAC converted into a revenue-normalized share before entering the margin expression. 4. Use the pre-shock fields to reproduce the treatment-intensity logic. The baseline design treats the 2021 DSL/PIPL implementation window as the national regulatory shock and uses pre-period exposure intensity as the heterogeneity dimension. The post-shock indicator begins in 2021Q4. Exposure and software-process intensity (SPI) are built from pre-treatment information only. 5. Reproduce the event-study models by interacting event-time indicators with pre-shock exposure intensity, using firm fixed effects and quarter fixed effects. Use the omitted pre-treatment quarter specified in the manuscript and cluster standard errors at the firm level. 6. Reproduce the baseline DID models by estimating post-shock exposure effects on ℓ, cinfra, and GM*, again using firm fixed effects and quarter fixed effects. Then estimate the moderation models by adding the interaction between post-shock exposure and SPI. 7. Reproduce the robustness checks using the dedicated sheets and documented assumptions: lead/lag inspection, placebo timing, alternative event timing, haircut sensitivity, peer exclusions, and proxy-validation exercises. 8. For the Monte Carlo extension, use the simulation input sheets and documented priors, transformations, and dependence assumptions. Draw from the reported marginal distributions, apply the calibrated correlation structure, compute horizon-specific GM* outcomes, and recover the probability that GM* remains above 40% at the 1-, 3-, and 5-year horizons. 9. Cross-check reproduced coefficients, descriptive values, and simulation summaries against the manuscript tables, appendix tables, and figure notes. Small differences may arise from software defaults, rounding, or alternative handling of transformed proxy fields, so use the workbook documentation as the source of truth for variable construction. 10. Cite the associated manuscript and this dataset package when using the data in subsequent work.

Institutions

Categories

Finance, Fintech

Licence