SSC-BanglaTutor: A Curriculum-Aligned Bengali Dataset for Intelligent Tutoring Systems

Published: 27 October 2025| Version 2 | DOI: 10.17632/krn9bzypsn.2
Contributors:
,
,
,
,
,
,
, Dipta Gomes

Description

This dataset comprises a Bengali-language educational corpus specifically curated to support the fine-tuning and evaluation of AI-driven, hint-based tutoring systems aligned with the Secondary School Certificate (SSC) science curriculum of Bangladesh. It contains a total of 11,286 structured question–answer–hint entries, distributed across three core science subjects: - Biology: 4,859 entries (14 chapters) - Chemistry: 3,034 entries (12 chapters) - Physics: 3,393 entries (14 chapters) Each entry includes: - A question written in Bengali - Five progressively ranked hints guiding learners from general to specific concepts - A convergence metric estimating the probability of a correct response at each hint - Correct and distractor answers based on common student misconceptions - Curriculum-aligned topic tags mapped to the SSC syllabus All data are encoded in UTF-8 JSON Lines (.jsonl) format, ensuring compatibility with Bengali NLP tools and large-scale AI training pipelines. The dataset’s structured design supports personalized feedback, enabling adaptive learning, retrieval-augmented generation (RAG), and fine-tuning of large language models (LLMs) for education in low-resource languages.

Files

Categories

Computer Science, Artificial Intelligence, Education, Natural Language Processing, Intelligent Tutoring System, Large Language Model

Licence