Data and Codes for "Analyzing Price Efficiency Using Machine Learning Generated Price Indices: the Case of the Chilean Used Car Market "

Published: 15 July 2025| Version 1 | DOI: 10.17632/jxby8pkww5.1
Contributors:
Fernando Diaz,

Description

This dataset contains all the necessary data and codes to replicate the main findings of the article, which examines how new car import prices affect the valuation of used vehicles in Chile's secondary car market. The study uses event study and difference-in-differences (DiD) methodologies to evaluate price efficiency. The replication package includes: - Used car price indexes dataset: Filtered and pre-processed to include key vehicle attributes (model, year, mileage, transmission, fuel type, seller type, region, etc.). A new car import dataset built from official customs records covering shipment dates, CIF prices, and units imported per model and version. R and Python scripts for estimation and analysis: "Unit Root Test.R": Tests for stationarity using Im, Pesaran, and Shin (IPS) panel tests; and "Event Study CARS.py": "Event Study CARS.py": Performs event study estimation using cumulative abnormal returns (CAARs), including subsample analyses by vintage and vehicle segment. "DiD Event Studies.R": Estimates difference-in-difference (DiD) regressions using staggered treatment timing and fixed effects and calculates cumulative abnormal returns (CAARs) from fitted values. Each code script includes comments and references to the corresponding tables and figures in the paper. Key findings that can be replicated using these files include: - Evidence of prompt and statistically significant price responses in the used car market following increases in new import prices. - Stronger responses among newer and high-end used cars. - These responses occur before the public release of import data, suggesting high informational efficiency. - The results are robust across different methodological approaches and sample partitions. This replication package allows for the independent verification of results. All datasets are anonymized and formatted for reproducibility. The codes are compatible with R 4.2+ and Python 3.8+ environments.

Files

Steps to reproduce

Data Collection and Reproducibility This replication package provides all the necessary information to reproduce the main results of the article. Due to privacy, licensing, and file size limitations, the raw dataset of more than 2.7 million used car advertisements is not included. Instead, it includes fully processed and aggregated data outputs, such as synthetic price indices by model and vintage, matched new car import records, and all panel datasets used for econometric analysis. The used car data were originally collected from Chilean online marketplaces between June 2020 and June 2023. Automated scripts were used to download ads weekly. Each listing contained information such as brand, model, year, mileage, price, transmission type, fuel type, seller category, and location. The raw listings were cleaned using a reproducible workflow that involved removing observations with missing or inconsistent values, standardizing text fields (e.g., model names), trimming outliers (top and bottom 5% by price and mileage), and filtering to retain only post-2010 vehicles with sufficient representation. These cleaned listings were then used to train price prediction models. Synthetic price indices were constructed using Random Forest regressors trained on the cleaned data. The models were tuned using 10-fold cross-validation and hyperparameter optimization. For each model-vintage combination, synthetic vehicles with representative attributes were created, and weekly predicted prices were generated. These prices serve as the basis for the price indices included in the replication files. New car import records were obtained from the Chilean National Customs Agency, cleaned, and matched to used car models. The resulting data includes shipment dates, CIF unit prices, and quantities imported for each model and version. String harmonization algorithms and manual validation were used to match the data and ensure consistency. To replicate the econometric results, the package includes panel datasets with abnormal return series, market indices, treatment flags, and event definitions. Two main estimation strategies were implemented: an event study approach based on cumulative abnormal returns (CAARs) and a difference-in-differences (DiD) regression with fixed effects and staggered treatment timing. These are conducted using R and Python. All estimation scripts are included. "Event Study CARS.py" conducts the event study analysis, and "DiD Event Studies.R" performs the DiD regressions. "DiD Event Studies.R" performs DiD regressions. "UnitRootTest.R" checks for stationarity in the price indices. Each script is documented and linked to the corresponding output in the manuscript. The full analysis can be reproduced using R (version 4.2 or later) or Python (version 3.8 or later). This approach ensures transparency and full reproducibility of the main results discussed in the article.

Institutions

Universidad Tecnica Federico Santa Maria

Categories

Automobile, Relation between Information and Market Efficiency

Licence