A Multi-Source Integrated Benchmark for Student Academic Risk Prediction (Education)
Description
The ARPS Integrated Dataset serves as a high-fidelity, multidimensional benchmark for Educational Data Mining (EDM), consolidating 11,523 records across 40 features. Engineered through a SQL Server 2022 ETL pipeline, it synthesizes six heterogeneous sources—including LMS behavioral logs (xAPI-Edu-Data), secondary-school performance records (UCI), and multi-factor performance datasets (Kaggle) into a unified, cohesive analytical repository. Architected under a Star Schema, the dataset ensures a 0% null rate and high structural integrity. It offers a holistic student profile across seven thematic domains: Identifiers, Demographics, Academic Performance, Behavioral Engagement, Family/Social Background, Study Characteristics, and Environmental Factors. This integration enables deep correlation analysis between digital behaviors, socioeconomic status, and academic outcomes. By providing a unified view of these disparate factors, the repository supports the development of predictive frameworks to categorize students into tiered risk levels (Low, Medium, and High). The dataset includes mixed data types: numerical (grades, GPA, absences), categorical (gender, parental education), and identifier fields. It is distributed in multiple formats, including SQL Server tables and UTF-8 encoded CSV files, ensuring compatibility with Python (pandas), R, and standard statistical tools.
Files
Steps to reproduce
The Gold Layer is the final analytical dataset produced after integrating and refining multiple educational data sources through a structured ETL pipeline in SQL Server. The process begins with the ingestion of six heterogeneous datasets (Kaggle and UCI) into the Bronze Layer using SQL Server Import tools. Each dataset is stored in its raw format without modification to preserve original structure and traceability. In the Silver Layer, data cleaning and transformation are performed. This includes standardizing column names across datasets, aligning data types (e.g., converting grades and scores to consistent numeric formats), and resolving schema mismatches. Missing values are handled through appropriate imputation techniques, ensuring that all features are complete. Categorical variables such as gender, grade letters, and parental education levels are normalized into consistent representations. Additionally, new derived features may be computed (e.g., average scores or aggregated behavioral indicators). A “source” column is added or preserved to track the origin of each record. Next, data integration is executed by merging the cleaned datasets into a unified schema. This involves aligning feature sets across all sources and performing vertical concatenation (union) of records. Feature harmonization ensures that similar attributes from different datasets (e.g., grades, scores, attendance) are mapped into common columns. Redundant or conflicting attributes are resolved, and consistent naming conventions are enforced. In the Gold Layer, the final analytical dataset is constructed as a denormalized flat table or a SQL view (e.g., gold_vw_ML_Dataset). This layer consolidates all transformed data into a single, analysis-ready structure. Key steps include selecting the final set of 40 features, ensuring zero missing values, and validating data consistency. Identifier fields (e.g., fact_id) are generated for uniqueness, and all records are indexed for efficient querying. Finally, the dataset is exported or exposed for downstream use. It can be accessed directly through SQL Server tables or views, or exported as a UTF-8 CSV file compatible with Python, R, and other analytics tools. The resulting Gold Layer serves as a clean, unified, and high-quality dataset ready for analysis, reporting, and machine learning workflows.
Institutions
- New Mansoura UniversityDakahlia, Al Mansurah
- Mansoura UniversityDakahlia, Al Mansurah