AI Disclosure and Corporate Misconduct Panel (U.S., 2020–2024)
Description
This dataset contains a firm-year panel of large U.S. publicly listed companies observed over the period 2020–2024. The panel includes approximately 50–60 firms (balanced structure where available), yielding roughly 250–300 firm-year observations. The sample focuses on large, technology-intensive and non-financial corporations for which artificial intelligence (AI) disclosure became strategically salient during this period. The primary purpose of the dataset is to examine the relationship between AI disclosure intensity and corporate misconduct, as well as the moderating role of executive power and governance oversight. AI disclosure intensity is measured using a deterministic dictionary-based count of AI-related terms extracted from annual Form 10-K filings. The counting procedure applies exact, case-insensitive string matching rules to a pre-specified list of AI-related stems (e.g., “artificial intelligen,” “machine learn,” “deep learn,” “neural network,” “algorithm,” “automation,” “predict,” “analytics,” “natural language,” “computer vision,” “autonomous”). The procedure does not involve semantic inference, classification, or machine learning. It produces an annual firm-level count of AI-related mentions, which serves as a proxy for AI salience in corporate disclosure. Corporate misconduct is measured as the annual count of regulatory enforcement actions associated with each firm, aggregated at the firm-year level and transformed as ln(1 + count) in empirical analyses. Governance oversight is proxied using a count of governance-risk disclosure phrases in 10-K filings (e.g., “material weakness,” “restatement,” “SEC investigation,” “internal control deficiency,” “compliance failure”). Executive power is measured using CEO duality (indicator equal to 1 if the CEO also serves as board chair). The dataset also includes financial control variables such as total assets, profitability (e.g., net income or ROA), leverage, and revenue, as well as sector classifications. Firm and year identifiers are included to facilitate panel estimation with fixed effects. All text-based variables are generated using standardized extraction prompts applied uniformly across firms and years, ensuring full transparency and replicability. The dataset supports replication of analyses examining nonlinear (quadratic) relationships between AI disclosure and misconduct, as well as moderated quadratic models incorporating executive power and governance oversight.
Files
Steps to reproduce
Sample Construction Identify the set of large U.S. publicly listed firms included in the dataset. Use firm identifiers (ticker and CIK) provided in the file. Construct a firm-year panel covering fiscal years 2020–2024. Retain firms with available 10-K filings and matching enforcement data. Apply consistent identifiers across years to ensure panel integrity. Download 10-K Filings Retrieve annual Form 10-K filings for each firm-year from the SEC EDGAR database. Convert filings to machine-readable text (HTML or TXT format). Remove exhibits and non-core attachments if necessary, retaining the main filing text (Business, Risk Factors, MD&A, and Financial Statements sections). AI Disclosure Extraction Apply deterministic, case-insensitive exact string matching using the predefined AI dictionary stems: “artificial intelligen”, “machine learn”, “AI” (standalone), “deep learn”, “neural network”, “algorith”, “automat”, “predict”, “forecast”, “optim”, “recommend”, “classif”, “detect”, “analytics”, “data-driven”, “natural language”, “computer vision”, “autonomous”. Count exact matches only (including plural and hyphenated forms). Do not apply semantic expansion or inference. Aggregate counts at the firm-year level to generate TOTAL_AI_MENTIONS. Governance-Risk Disclosure Extraction Using the same deterministic procedure, count case-insensitive exact occurrences of: “material weakness”, “internal control deficiency”, “internal control weakness”, “restatement”, “non-reliance”, “SEC investigation”, “regulatory investigation”, “government investigation”, “compliance failure”. Aggregate counts at the firm-year level to generate TOTAL_GOVERNANCE_RISK_MENTIONS. Misconduct Data Construction Obtain regulatory enforcement data (e.g., SEC enforcement actions). Match events to firms using name and CIK identifiers. Aggregate enforcement events by firm and fiscal year. Construct the dependent variable as ln(1 + enforcement count). Financial and Governance Controls Extract financial statement data (total assets, total debt, revenue, net income) from 10-K filings or Compustat. Construct: Size = ln(total assets) Leverage = total debt / total assets Profitability (ROA or net income / assets) Sales growth (year-over-year revenue change) Code CEO duality as 1 if the CEO also serves as board chair. Variable Construction Mean-center AI disclosure intensity before generating the quadratic term (AI²). Create moderation variables (e.g., OversightStrength as inverse of governance-risk mentions if used). Estimation Estimate fixed-effects panel regressions with firm and year effects. Cluster standard errors at the firm level. Test for U-shaped effects (β1 < 0, β2 > 0), calculate turning points, and compute marginal effects at low and high AI salience. These steps reproduce all variables and empirical models reported in the study.
Institutions
- Ca' Foscari University of VeniceVeneto, Venice