Machine Learning Modelling of Groundwater Water Quality Index (WQI)

Published: 27 January 2026| Version 1 | DOI: 10.17632/xjdd8jyjz5.1
Contributors:
BISWANATH MAHANTY,

Description

The dataset comprises groundwater quality data collected during a post-monsoon sampling campaign from the Jajpur district, Odisha, India. Twelve hydrochemical parameters were used as predictors for Water Quality Index (WQI) modelling. The WQI served as the response variable. The dataset was analysed using Bayesian hyperparameter–optimized artificial neural network (ANN), random forest (RF), and multiple linear regression (MLR) models. Model evaluation included cross-validation, resilience testing, predictor importance analysis This submission includes the compiled groundwater dataset, optimized model architectures, and source codes used for analysis and graphical representation.

Files

Steps to reproduce

The repository is organized into four main folders. The main code (WQA_analysis_Jajpur.m) performs the complete water quality assessment using the original post-monsoon groundwater dataset (a0_Postmonsoon_JAJAPUR.mat). This script calls functions from the "common_codes" folder to conduct descriptive statistical analyses (minimum, maximum, mean, standard deviation, skewness, kurtosis, data distribution, and correlation heatmaps) and develop standard water quality prediction models to get the water quality result used as response for machine-learning approaches, including multiple linear regression (MLR), artificial neural networks (ANN), and random forest (RF). The "variable_selection" folder contains the input variable datasets and corresponding model responses (b0_X_GQ.mat). The main script (b1_variable_selection_main.m) invokes additional codes within this folder to implement both filter- and wrapper-based feature selection methods, including Bayesian selection, MRMR, PPE, sequential, and stepwise approaches, along with scripts for figure generation. The "Bootstrap_new" folder includes scripts for bootstrap resampling and repeated Bayesian variable selection to evaluate model robustness and feature stability. Finally, the "Map Codes" folder contains the spatial data files (myshape1.shp and ODISHA_SUBDISTRICT_BDY.dbf) and the main script (mapping_exercise.m) used to generate all spatial distribution maps presented in the study.

Institutions

Karunya University, Siksha O Anusandhan University

Categories

Artificial Neural Network, Correlation Analysis, Multiple Linear Regression, Drinking Water Quality, Random Decision Forest

Licence