A Fully Synthetic Textual Dataset of Student Learning Habits and Preferences Generated Using a Large Language Model
Description
This repository contains a fully synthetic textual dataset generated using a large language model (GPT-4 family). The dataset simulates fictional student learning habits, preferences, challenges, and opinions related to online education. No real individuals, surveys, or proprietary data sources were used in the creation of this dataset. The dataset is intended for research benchmarking, natural language processing (NLP), educational data mining, survey analysis, and machine learning experimentation.
Files
Steps to reproduce
The dataset was fully generated using a ChatGPT-style large language model. First, the survey schema and controlled vocabularies were defined for all columns: respondent_id, education_level, study_hours_per_day, preferred_learning_method, main_learning_challenge, motivation_level, online_learning_opinion, and device_used_for_study. A structured prompt template instructed the LLM to generate realistic but entirely fictional student profiles. Categorical fields were randomly sampled from predefined vocabularies, numerical values were constrained to set ranges, and textual opinions were varied for diversity. Validation ensured unique respondent IDs, no duplicate combinations, and consistent data formats. The resulting 10,000 synthetic records were exported as a CSV file (synthetic_student_learning_dataset_10000.csv) and accompanied by a README and data dictionary. This procedure allows other researchers to reproduce the dataset using the same schema, prompt strategy, and controlled vocabularies without any real human data.
Institutions
- Patuakhali Science and Technology University