A Fully Synthetic Textual Dataset of Student Learning Habits and Preferences Generated Using a Large Language Model

Name: A Fully Synthetic Textual Dataset of Student Learning Habits and Preferences Generated Using a Large Language Model
Creator: Mehedi Hasan
Published: 2026-01-15T21:02:38.276Z
Keywords: Computer Science, Education

Hasan, Mehedi

doi:10.17632/fysyzdknsk.3

A Fully Synthetic Textual Dataset of Student Learning Habits and Preferences Generated Using a Large Language Model

Published: 15 January 2026| Version 3 | DOI: 10.17632/fysyzdknsk.3

Contributor:

Mehedi Hasan

Description

This repository contains a fully synthetic textual dataset generated using a large language model (GPT-4 family). The dataset simulates fictional student learning habits, preferences, challenges, and opinions related to online education. No real individuals, surveys, or proprietary data sources were used in the creation of this dataset. The dataset is intended for research benchmarking, natural language processing (NLP), educational data mining, survey analysis, and machine learning experimentation.

Files

Steps to reproduce

The dataset was fully generated using a ChatGPT-style large language model. First, the survey schema and controlled vocabularies were defined for all columns: respondent_id, education_level, study_hours_per_day, preferred_learning_method, main_learning_challenge, motivation_level, online_learning_opinion, and device_used_for_study. A structured prompt template instructed the LLM to generate realistic but entirely fictional student profiles. Categorical fields were randomly sampled from predefined vocabularies, numerical values were constrained to set ranges, and textual opinions were varied for diversity. Validation ensured unique respondent IDs, no duplicate combinations, and consistent data formats. The resulting 10,000 synthetic records were exported as a CSV file (synthetic_student_learning_dataset_10000.csv) and accompanied by a README and data dictionary. This procedure allows other researchers to reproduce the dataset using the same schema, prompt strategy, and controlled vocabularies without any real human data.

Institutions

Patuakhali Science and Technology University

A Fully Synthetic Textual Dataset of Student Learning Habits and Preferences Generated Using a Large Language Model

Description

Files

Steps to reproduce

Institutions

Categories

Licence