Benchmark data for five high-dimensional regression estimators under AR(1) Toeplitz designs

Published: 2 March 2026| Version 1 | DOI: 10.17632/n8ffhnx97t.1
Contributor:
Fred Torres-Cruz

Description

This dataset contains the Monte Carlo simulation results used in the study Estimation Accuracy vs. Inferential Validity in High-Dimensional Regression: A Monte Carlo Benchmark of Five Estimators. It benchmarks five high-dimensional linear regression methods—LASSO, Ridge, Elastic Net, SCAD, and Debiased LASSO via ridge projection—across 36 design scenarios defined by three sample sizes (50, 100, and 200), two predictor dimensions (200 and 500 variables), two signal-to-noise ratios (1 and 3), and three random seeds (42, 123, and 456). All simulations were conducted under an AR(1) Toeplitz covariance structure with correlation parameter 0.5 and a sparsity level of 10 nonzero coefficients. The unified CSV file contains 180 rows (36 scenarios evaluated with five methods) and 19 variables, reporting coefficient estimation error, prediction error, variable-selection performance (including F1 score and support recovery), empirical 95% confidence-interval coverage and width, and computational runtime. The data are entirely synthetic, generated from sparse Gaussian linear models, and contain no personal or sensitive information.

Files

Steps to reproduce

1. Set the random seed to one of the following values: 42, 123, or 456. 2. Generate predictor variables from a multivariate normal distribution with an AR(1) Toeplitz covariance structure and correlation parameter 0.5. 3. Set the number of predictors to either 200 or 500 and the sample size to 50, 100, or 200. 4. Specify 10 nonzero regression coefficients drawn from a standard normal distribution and set all remaining coefficients to zero. 5. Generate Gaussian noise corresponding to signal-to-noise ratios of 1 or 3. 6. Fit the following models using standard R implementations: • LASSO, Ridge, and Elastic Net (glmnet) • SCAD (ncvreg) • Debiased LASSO via ridge projection (hdi package) 7. Compute performance metrics including coefficient mean squared error, prediction mean squared error, F1 score, confidence interval coverage, confidence interval width, support recovery, and runtime. 8. Repeat for all 36 design scenarios and combine results into a unified dataset containing 180 rows.

Categories

Statistics, Data Science, Machine Learning, High-Dimensional Data Analysis

Licence