The Emigration Frontier: Income Gaps, Population Scale, and Bounded Migration Adjustment

Published: 1 June 2026| Version 1 | DOI: 10.17632/pyb56yg9s7.1
Contributor:
Robert Mullings

Description

Replication Package for "The Emigration Frontier: Income Gaps, Population Scale, and Bounded Migration Adjustment" This dataset provides the full replication package for "The Emigration Frontier: Income Gaps, Population Scale, and Bounded Migration Adjustment" (Mullings, 2026). Drawing on a global panel of 183 countries since 1970, the paper asks why some countries relieve income gaps through large-scale emigration while others do not, despite facing similar incentives to migrate. The central object of analysis is an upper envelope on cross-country emigration outcomes — the "emigration frontier" — populated mainly by small, socially globalized economies with concentrated diaspora networks. Average emigration responses to income gaps are weak and unstable, but upper-tail responses near the frontier are large, persistent, and economically meaningful; reductions in mobility barriers shift this frontier outward asymmetrically, with treatment effects increasing monotonically across quantiles. The empirical strategy combines unconditional quantile regression with a quantile difference-in-differences design exploiting the 1998 opening of European Union accession negotiations with the A8 countries. The package contains the processed analysis dataset (merged_df_final_revision.csv, 9,667 country-year observations covering 1970–2022) and all code required to reproduce every table and figure in the manuscript and appendices. Code is provided in three languages corresponding to the paper's analytical pipeline: R (Manual_Data_Analysis.R) for upstream data assembly from seven raw public-domain sources; Stata (estimate_final_v3.do) for the main quantile regressions reported in Tables 1–4; and Python scripts 01–05 for the Section 4 distributional figures, the Section 7 difference-in-differences and event-study results (Tables 5–6, Figures 7–10), and Appendices B, D, and E. A driver script (run_all.py) executes the Python pipeline end-to-end. See README.md for software requirements, expected runtimes, and the precise mapping from each output file to its location in the manuscript. Seven raw input files are not redistributed because they remain subject to the licences of their original providers, but all are openly accessible: Standaert-Rayp bilateral migration (Mendeley Data, DOI 10.17632/cpt3nh6jct.2), Penn World Tables v11.0, KOF Globalisation Index 2025, Barro-Lee education v3, UCDP/PRIO Armed Conflict v25.1, UN World Population Prospects age data, and Quality of Government Standard TS (January 2026). Direct download URLs and expected filenames are documented in README.md. Replicators wishing to regenerate the processed CSV from raw sources should place these seven files in the working directory before running the R script; replicators content with the processed CSV can skip directly to the Stata and Python analyses.

Files

Steps to reproduce

Download and extract the replication package. All scripts assume that the working directory is the unzipped folder containing merged_df_final_revision.csv, the code files, and README.md. Software requirements. Python 3.10 or later, R 4.3 or later (only needed if regenerating the processed CSV from raw sources — see step 6 below), and Stata 17 or later. From the working directory, install the Python dependencies with pip install -r requirements.txt. Required packages are pandas≥1.5, numpy≥1.23, statsmodels≥0.13, matplotlib≥3.5, and scipy≥1.9. Reproduce the Python pipeline by running python run_all.py from the working directory. This driver executes the five analysis scripts in order: 01_figure2_section4.py (Section 4 distributional figures, including Figures 2 and 3 and the frontier visualisations), 02_appendix_B_undesa.py (Appendix B UN DESA cross-validation), 03_section7_did.py (Section 7 quantile difference-in-differences, producing Tables 5 and 6, Figures 7 through 10, the event-study coefficients, and the joint pre-trend Wald tests), 04_appendix_D_oos.py (Appendix D out-of-sample validation, Tables D.1 through D.3), and 05_appendix_E_augmented_robustness.py (Appendix E augmented-specification robustness). All outputs are written to output/. Total runtime is approximately 20–30 minutes on a modern laptop, dominated by the 500-permutation randomization-inference loop in script 03. Reproduce the main quantile regressions (Tables 1, 2, 3, and 4) by running estimate_final_v3.do in Stata from the working directory. The script reads merged_df_final_revision.csv directly, fits the headline specifications, and writes a console log. Expected runtime is approximately 5–10 minutes. Verify reproduction by comparing the values in output/ and the Stata log against the corresponding tables and figures in the manuscript. The README.md provides a precise mapping of each output file to its location in the paper. The joint pre-trend tests (output/pretrend_wald_tests.csv) should give χ²(4) = 7.04 (p = 0.134) at τ = 0.50 and χ²(4) = 94.02 (p < 0.001) at τ = 0.90, matching the values reported in Section 7.3. (Optional) Regenerate the processed dataset from raw sources. The included merged_df_final_revision.csv is sufficient for full replication of every table and figure in the manuscript. To rebuild it from raw inputs, place the seven external data files (listed in README.md with download URLs) in the working directory and run Rscript Manual_Data_Analysis.R. Expected runtime is approximately 10–15 minutes. The output will be identical to the provided CSV.

Institutions

Categories

Geography, Economics, Econometrics, Demography, Development Planning

Licence