Multi-Source Educational Analytics Dataset for Student Performance and Contextual Analysis

Published: 22 April 2026| Version 1 | DOI: 10.17632/rkrptyy994.1
Contributors:
Omar Elzeki,
,
,
,
,

Description

This dataset is a unified analytical dataset containing 32,593 records, each representing a student’s participation in a specific course presentation. It is constructed by integrating multiple data sources, including internal academic activity data from the Open University Learning Analytics Dataset (OULAD), regional broadband speed statistics from Ofcom, and unemployment rates from the UK Office for National Statistics (ONS). The integration results in a comprehensive dataset that combines educational, behavioral, and socioeconomic perspectives into a single analytical structure . The dataset represents the final Gold layer in a Medallion data architecture and is organized as a structured tabular dataset with 20 features. These features include a mix of numerical and categorical data types, capturing diverse aspects of student profiles. Each record contains demographic attributes such as gender, age group, education level, and region, alongside academic information including course module, year, term, number of previous attempts, and credits studied. Behavioral data is represented through interaction metrics within the Virtual Learning Environment (VLE), such as total clicks and accessed learning materials, as well as engagement indicators like submitted assessments. In addition to internal academic data, the dataset incorporates external contextual variables, including regional unemployment rates and average broadband speeds. This enrichment enables the dataset to reflect not only individual academic behavior but also the broader environmental conditions that may influence student performance. Academic outcomes are represented through a binary variable indicating whether a student successfully completed the course or faced failure or withdrawal. The dataset is provided in CSV format, ensuring compatibility with common data analysis environments such as Python, SQL-based systems, and standard machine learning frameworks. All data sources are publicly available and fully anonymized. Personal identifiers are not included, and external indicators are aggregated at the regional level, ensuring that individual privacy is preserved. This integrated dataset supports comprehensive analysis of student engagement, performance, and contextual influences, making it suitable for educational analytics, data mining, and institutional research applications.

Files

Steps to reproduce

The Gold Layer is produced through a structured Medallion architecture pipeline consisting of Bronze, Silver, and Gold stages. The process begins in the Bronze Layer, where raw datasets are ingested from multiple sources without modification. These include academic interaction data from OULAD, broadband infrastructure data from Ofcom, and regional unemployment statistics from ONS. Each dataset is stored in its original format to preserve data lineage and traceability. In the Silver Layer, data preprocessing and harmonization are applied. This stage includes cleaning missing or inconsistent values, standardizing data types (e.g., converting categorical and numeric formats), and aligning schemas across datasets. Feature names are unified, and redundant or irrelevant attributes are removed. Behavioral variables such as VLE interactions are aggregated (e.g., total clicks, number of accessed materials), and academic performance indicators are computed (e.g., weighted scores, number of submitted assessments). External datasets are transformed to match the temporal and spatial granularity of the academic data, ensuring consistency across regions and academic periods. Next, data integration is performed by joining datasets using common dimensions such as region, academic year, and course presentation period. This step resolves temporal and spatial mismatches between internal academic data and external socioeconomic indicators. Aggregation functions are applied to ensure that external variables (e.g., unemployment rate, broadband speed) are correctly mapped to each student record based on location and time. In the Gold Layer, the final dataset is constructed as a denormalized analytical table. All features are consolidated into a single unified schema with consistent representation. Records are validated to ensure completeness, resulting in a dataset with no missing values. Final feature selection is applied to retain the most relevant 20 attributes, covering demographic, academic, behavioral, and environmental dimensions. The resulting dataset is stored in CSV format and made available for downstream analytics. It provides a clean, integrated, and analysis-ready structure suitable for data mining, reporting, and advanced analytical applications.

Categories

Economics, Education, Data Mining, Data Analysis, Classification System, Analysis of Education

Licence