De-identified U.S. Truck Crash Events with Standardized Operational and Environmental Classifications
Description
This dataset contains de-identified U.S. truck crash events derived from the Federal Motor Carrier Safety Administration (FMCSA) Motor Carrier Management Information System (MCMIS). The data has been structurally filtered, reduced, and transformed using a fully deterministic, rule-based process. The transformation standardizes environmental, roadway, vehicle, and operational characteristics into controlled classification codes while preserving original identifiers and outcome variables. The dataset is designed for reproducible analysis, distributional comparison, and structured incident classification. No probabilistic modeling, inference, or causal interpretation is introduced. All outputs are traceable to source data or explicit transformation rules.
Files
Steps to reproduce
(1) Acquire source data from the Federal Motor Carrier Safety Administration (FMCSA) Motor Carrier Management Information System (MCMIS) via the U.S. Department of Transportation data portal as a bulk CSV export. (2) Apply structural filtering by excluding records from U.S. territories, excluding records prior to 1990, and removing records with missing or invalid required fields. (3) Retain only the defined column set consisting of CRASH_ID, REPORT_NUMBER, REPORT_DATE, REPORT_STATE, TRAFFICWAY_ID, ACCESS_CONTROL_ID, ROAD_SURFACE_CONDITION_ID, CARGO_BODY_TYPE_ID, GVW_RATING_ID, WEATHER_CONDITION_ID, VEHICLE_CONFIGURATION_ID, HAZMAT_RELEASED, FATALITIES, INJURIES, TOW_AWAY, and CRASH_CARRIER_INTERSTATE. (4) Perform controlled sampling by traversing records sequentially, selecting approximately one record per ten, randomizing selection within a forward window of 5–15 records, and continuing until the dataset is fully traversed. (5) Apply deterministic transformations by reformatting dates (YYYYMMDD to YYYY-MM-DD), normalizing binary fields (Y/N to YES/NO/UNKN), and mapping categorical variables into standardized code sets (e.g., ENV_, RDS_, TYP_, STE_, ACC_, TRA_, CBT_, GVW_) using fixed lookup tables. (6) Output each record as an independent JSON object in JSONL format while preserving identifiers and mapped fields. (7) Validate the resulting dataset by comparing distributions (state, vehicle configuration, road condition, and weather) against the original dataset to confirm proportional consistency.