Independent edge evaluation for directed acyclic graphs in probabilistic sequence modeling

Published: 18 May 2026| Version 1 | DOI: 10.17632/sn5r9g45rr.1
Contributors:
,

Description

This repository contains the official replication package and data artifacts for the Step-Level Diagnostic Engine (SLDE) framework and the Independent Edge Evaluation (IEE) methodology, ensuring complete reproducibility of the empirical results presented in the manuscript. The dataset consists of three primary components structured to support process-oriented assessment space evaluation.   The first component is the ASSISTments 2009-2010 Dataset (stored in skill_builder_data.csv). This file contains the benchmark, publicly available raw sequential tracking data from the ASSISTments 2009-2010 skill-builder corpus. To isolate multi-step cognitive structures, the replication script applies strict preprocessing rules directly to this raw file. For main tasks (original=1), records with null skill names or non-binary correctness are discarded. For sub-step scaffold rows (original=0), records are retained and programmatically inherit latent dependency labels from the parent main task via the assistment_id. The final execution filters a restricted cohort of exactly M = 669 unique learners, structured to enforce zero temporal data leakage during downstream sequence modeling.   The second component is the Prerequisite Graph Topology (stored in assistments_prereq_edges.sql). This file provides the explicit graph-theoretic formalization of the assessment space used to evaluate cross-skill transfer. Parsed directly via regular expressions within the scripts, it contains a mapping of 110 unique Knowledge Components (KCs), directed prerequisite edges mapping cognitive transitions, and a set of 857 pre-enumerated valid pedagogical paths ordered longest-first, acting as the ground-truth Directed Acyclic Graph (DAG) templates.   The third component is the Complete Source Code Package (stored in IEE_DAG_Final_Submit.ipynb). This comprehensive Python replication notebook contains the complete implementation pipeline. It includes environment setup pinning sympy==1.13.1 to resolve native compatibility conflicts with PyTorch 2.x optimizers. It provides the exact mathematical architecture for the proposed IEE-BKT model (implementing localized evidence vectors and K-step sequential Bayesian updates) along with re-implemented baselines including Standard BKT, LSTM-based Deep Knowledge Tracing (DKT), and Self-Attentive Knowledge Tracing (SAKT). Finally, it executes a strict student-level 10-fold cross-validation routine to control for intra-student tracking bias.   Researchers can utilize these files to fully replicate the evaluation metrics, including paired t-test distributions, RMSE improvements, and visual ROC/AUC patterns where IEE-BKT secures a baseline performance of AUC = 0.5514.

Files

Steps to reproduce

1. Environment preparation: Ensure a standard Python 3.x environment is available, preferably via Google Colab or a local Jupyter Notebook server. Install the exact external library dependencies by executing the following commands to resolve native compatibility constraints (specifically, pinning SymPy is required to avoid known initialization conflicts between older PyTorch 2.x optimizers and newer SymPy versions): pip install pyBKT --quiet pip install sympy==1.13.1 --quiet 2. Data and Graph alignment: Download the raw public dataset 'skill_builder_data.csv' (ASSISTments 2009-2010 skill-builder corpus) and the prerequisite graph topology file 'assistments_prereq_edges.sql' provided in this Mendeley Data repository. Place both files into your active working directory or upload them to your personal Google Drive storage. 3. Path configuration within the Notebook: Open the replication notebook 'IEE_DAG_Final_Submit.ipynb'. Navigate directly to "Section 1 - Data Loading and Prerequisite Graph Parsing". Locate the path variable block and modify the hardcoded strings to match your runtime environment. For example: If using local Jupyter or direct Colab file uploads, update to: DATA_PATH = "skill_builder_data.csv" SQL_PATH = "assistments_prereq_edges.sql" If using Google Drive mounting, authenticate the runtime and adjust the folder directory paths accordingly. 4. Execution and Verification: Execute all notebook cells sequentially from top to bottom (Cell > Run All). Section 1 will automatically filter the raw corpus under strict rules (cleaning main tasks where original=1 and retaining scaffold rows where original=0) to construct the restricted cohort of exactly M = 669 unique learners. The script will parse the 110 Knowledge Components and 857 valid pedagogical paths from the SQL file via regular expressions. The pipeline will run a student-level 10-fold cross-validation protocol to eliminate temporal data leakage, training BKT, IEE-BKT, DKT, and SAKT head-to-head. Monitor the final summary outputs to verify and reproduce the paired t-test distributions, RMSE improvements, and the baseline performance of AUC = 0.5514 displayed in Table 5 of the manuscript.

Categories

Computer Science, Artificial Intelligence, Education

Licence