A Multi-Source Data Warehouse of Academic Student Risk Prediction (Education)

Published: 21 April 2026| Version 1 | DOI: 10.17632/8tvbwh3gvb.1
Contributors:
Omar Elzeki,
,
,
,

Description

The ARPS Integrated Dataset is a unified educational dataset developed as the final output of an Academic Risk Prediction System data warehouse. It contains 11,522 student records and 40 feature columns, created by integrating six publicly available datasets collected from Kaggle and the UCI Machine Learning Repository. These datasets represent diverse educational contexts, including learning management system behavioral logs, high school GPA records, secondary school grade datasets, standardized exam scores, and multi-factor student performance data. The integration process was performed through a structured ETL pipeline implemented in Microsoft SQL Server 2022 Express . The dataset is structured as a tabular dataset organized within a star schema in a data warehouse environment. Each row represents a single student and includes a comprehensive profile combining demographic, academic, behavioral, and socioeconomic attributes. The dataset contains a mix of numerical features such as grades, exam scores, study hours, and absences, as well as categorical features such as gender, educational stage, parental education level, and other contextual indicators. A dedicated “source” field is included to trace each record back to its original dataset. After preprocessing and data cleaning, the dataset contains no missing values across all features. The 40 features are organized into seven main groups: identifiers, demographic attributes, academic performance indicators, behavioral metrics, family and social background, study-related characteristics, and environmental factors. These groups collectively provide a holistic view of student conditions, covering both academic outcomes and contextual influences such as family support, access to resources, and lifestyle-related behaviors. The dataset is stored and distributed in multiple formats to ensure compatibility with different analytical environments. It is primarily maintained within SQL Server tables and a unified analytical view, and can also be exported as a UTF-8 encoded CSV file compatible with common tools such as Python, R, and spreadsheet applications. Additionally, it can be accessed programmatically through standard database connectivity interfaces. All underlying data sources are fully anonymized and publicly available. The dataset does not contain any personally identifiable information such as names, IDs, or contact details, ensuring compliance with ethical standards for academic research.

Files

Steps to reproduce

The Gold Layer is the final analytical dataset produced after integrating and refining multiple educational data sources through a structured ETL pipeline in SQL Server. The process begins with the ingestion of six heterogeneous datasets (Kaggle and UCI) into the Bronze Layer using SQL Server Import tools. Each dataset is stored in its raw format without modification to preserve original structure and traceability . In the Silver Layer, data cleaning and transformation are performed. This includes standardizing column names across datasets, aligning data types (e.g., converting grades and scores to consistent numeric formats), and resolving schema mismatches. Missing values are handled through appropriate imputation techniques, ensuring that all features are complete. Categorical variables such as gender, grade letters, and parental education levels are normalized into consistent representations. Additionally, new derived features may be computed (e.g., average scores or aggregated behavioral indicators). A “source” column is added or preserved to track the origin of each record. Next, data integration is executed by merging the cleaned datasets into a unified schema. This involves aligning feature sets across all sources and performing vertical concatenation (union) of records. Feature harmonization ensures that similar attributes from different datasets (e.g., grades, scores, attendance) are mapped into common columns. Redundant or conflicting attributes are resolved, and consistent naming conventions are enforced. In the Gold Layer, the final analytical dataset is constructed as a denormalized flat table or a SQL view (e.g., gold_vw_ML_Dataset). This layer consolidates all transformed data into a single, analysis-ready structure. Key steps include selecting the final set of 40 features, ensuring zero missing values, and validating data consistency. Identifier fields (e.g., fact_id) are generated for uniqueness, and all records are indexed for efficient querying. Finally, the dataset is exported or exposed for downstream use. It can be accessed directly through SQL Server tables or views, or exported as a UTF-8 CSV file compatible with Python, R, and other analytics tools. The resulting Gold Layer serves as a clean, unified, and high-quality dataset ready for analysis, reporting, and machine learning workflows.

Institutions

Categories

Education, Data Mining, Machine Learning, Feature Selection, Multi-Classifiers, Predictive Modeling

Licence