Sleep and Awake

Published: 5 August 2025| Version 1 | DOI: 10.17632/x9hk67h4g6.1
Contributor:
rashug Rashid

Description

Research Hypothesis The study hypothesizes that the distribution of data across training, validation, and test sets, categorized by awake and sleep states, will be balanced to support robust model performance across physiological conditions. It assumes the training set will dominate, with proportional allocations to validation and test sets, and that awake and sleep categories will reflect their natural or experimental occurrence. Data Description The dataset totals 2522 samples, split into training (2090 samples), validation (216 samples), and test (216 samples). It is categorized into awake and sleep states, visualized with the following percentages: Train - Awake: 46.2% Train - Sleep: 36.6% Validation - Sleep: 4.6% Validation - Awake: 3.97% Test - Sleep: 4.6% Test - Awake: 3.97% Data Collection The data likely originates from universe.roboflow.com to classify awake and sleep states. The dataset was divided with the training set at 83% (2090/2522), and validation and test sets each at 8.6% (216/2522), following a typical 80-10-10 split adjusted to the sample size. Notable Findings Training Set Dominance: The training set, with 46.2% awake and 36.6% sleep, comprises 82.9% of the data, emphasizing model training. The higher awake proportion suggests potential overrepresentation. Balanced Validation and Test Sets: Both sets are nearly identical (4.6% sleep, 3.97% awake each), ensuring consistent performance evaluation. Sleep vs. Awake Imbalance: Awake states total 54.14% (46.2% + 3.97% + 3.97%), while sleep states total 45.86% (36.6% + 4.6% + 4.6%), indicating a natural or intentional bias toward awake data. Interpretation and Use Model Training: The large training set, with more awake data, suggests optimization for awake-state predictions. Researchers should address potential overfitting by augmenting sleep data or adjusting weights. Validation and Testing: The equal validation and test set distributions support reliable model tuning and testing, though their small size (8.6% each) may limit edge-case detection. Implications: The awake-sleep imbalance may necessitate additional sleep data collection for fairness, unless it reflects real-world conditions. Practical Use: This distribution suits training machine learning models (e.g., sleep stage classification) with the training set, validating hyperparameters with the validation set, and testing with the test set. Rebalancing techniques may be needed if sleep accuracy is critical. This analysis highlights the dataset’s structure, guiding its use in predictive modeling while addressing potential biases.

Files

Institutions

  • Soroti University

Categories

Neuroscience, Biomedical Engineering, Machine Learning

Licence