Balanced and Augmented Version of the HAM10000 Skin Lesion Dataset (Derived & Corrected)
Description
Overview The HAM10000 (“Human Against Machine with 10,000 training images”) dataset is one of the most widely used collections for skin lesion classification. It contains 10,015 dermatoscopic images categorized into seven diagnostic classes: Melanocytic Nevi (NV), Melanoma (MEL), Benign Keratosis (BKL), Basal Cell Carcinoma (BCC), Actinic Keratoses (AKIEC), Vascular Lesions (VASC), and Dermatofibroma (DF). One of the major challenges in the original HAM10000 dataset is its highly imbalanced class distribution. The NV class alone makes up about 67% of all samples, while the minority classes DF and VASC together represent less than 3%. This imbalance leads to biased models that perform well on common classes but poorly on rare lesions. Many researchers tried to fix this by heavily upsampling small classes for example, increasing a 150-image class to over 1000 samples which often makes models overfitted and unrealistic. This derived version was created to offer a balanced and scientifically responsible alternative. It uses a combination of undersampling for large classes and controlled augmentation for small ones. A target of roughly 500–650 training samples per class was selected to maintain fairness while preserving data diversity. Larger classes such as NV, MEL, and BKL were undersampled to around 500 samples each to prevent majority dominance. Smaller classes like AKIEC, DF, and VASC were augmented carefully using realistic transformations such as random rotations (±30°), horizontal/vertical flips, scaling (0.8–1.2×), brightness and contrast adjustments (±20%), and mild Gaussian noise. This ensured that no class was artificially inflated or distorted. The final dataset structure is as follows: AKIEC – Train: 654, Test: 150 BCC – Train: 500, Test: 150 BKL – Train: 500, Test: 150 DF – Train: 537, Test: 115 MEL – Train: 500, Test: 150 NV – Train: 500, Test: 150 VASC – Train: 568, Test: 142 These numbers create a nearly uniform and balanced dataset without losing important image diversity. The slight variations between class sizes (500–650) are intentional and defendable. They help preserve genuine data from minority classes while preventing excessive synthetic augmentation. Forcing all classes to have an identical count could remove valuable real samples or produce too many artificial images, which would reduce model generalization. This version provides a fair, realistic, and reproducible dataset for training skin lesion classification models. It reduces overfitting, improves class-level balance, and ensures better generalization. Researchers can confidently use this dataset to evaluate fairness and robustness in medical image classification tasks.
Files
Steps to reproduce
Attribution: This dataset is a derived work based on the original HAM10000 dataset by Tschandl, P., Rosendahl, C., & Kittler, H. (2018). The HAM10000 dataset: A large collection of multi-source dermatoscopic images of common pigmented skin lesions. Harvard Dataverse. DOI: [10.7910/DVN/DBW86T]. This version follows the same CC BY-NC 4.0 license and must credit the original authors.