From Expansion to Elimination, DATA
Description
This project performs a Bayesian hierarchical analysis to investigate the factors influencing energy cost burden across different ZIP codes and years. Using panel data from multiple Excel files spanning several years (2012-2022), the project aims to model the relationship between energy cost burden and various predictors including tax_returns, uptake (presumably related to program participation or energy efficiency measures), and percent_white. The core of the analysis involves: Data Loading and Preprocessing: Combining data from multiple years, handling missing values, and standardizing predictor variables. Hierarchical Modeling: Building a Bayesian hierarchical model using PyMC that accounts for variation across both ZIP codes and years through the use of random effects. Inference: Performing inference using both variational inference (ADVI) and Markov Chain Monte Carlo (MCMC) methods, specifically the No-U-Turn Sampler (NUTS), to estimate the posterior distributions of the model parameters. Diagnostics and Comparison: Analyzing the convergence diagnostics (R-hat, ESS, divergences) for the MCMC samples and comparing the results obtained from ADVI and NUTS to understand the reliability of the different inference methods for this model and dataset. Exploratory Analysis: Including steps for basic data exploration such as summary statistics, correlation analysis, and time trends of key variables. The project highlights the importance of using robust MCMC methods like NUTS for complex models, especially when simpler approximations like ADVI might yield conflicting conclusions, and includes steps to improve sampler performance and assess convergence.
Files
Steps to reproduce
To reproduce this analysis in a Google Colab environment: Upload Data: Upload the following Excel files to the /content/ directory: Correct 2013_Diagnostics.xlsx Correct2011_Diagnostics.xlsx Correct2016_Diagnostics.xlsx Correct2019_Diagnostics.xlsx Run Notebook: Execute all code cells sequentially within this Google Colab notebook. Verify Environment: Ensure the Colab runtime uses Python 3.12.11 and that the specified library versions (PyMC 5.25.1, ArviZ 0.22.0, NumPy 2.0.2, Pandas 2.2.2) are installed (the first code cell handles installation). Check Random Seeds: Confirm that the random seeds (random_state=42 or random_seed=42) are maintained in the relevant data preprocessing and sampling steps.