A raw and CIELAB-enhanced dataset of sputum smear microscopy images for Mycobacterium tuberculosis detection

Published: 8 April 2026| Version 3 | DOI: 10.17632/34gymtj5yc.3
Contributors:
,
,
,
, Edio da Costa

Description

This dataset presents a comprehensive collection of 1.338 sputum smear microscopy images containing 11.248 ground truth bounding box labels for the detection of Mycobacterium tuberculosis. The clinical specimens were sourced from Dr. Mohamad Soewandhi Regional General Hospital and Surabaya Pulmonary Hospital, Indonesia, and captured using a standard optical microscope equipped with a Hayear digital microscope camera. To address common challenges in microscopic imaging, such as uneven background illumination, dust artifacts, and spatial noise, the dataset is provided in two distinct versions to facilitate diverse experimental setups: Raw Dataset: Contains the original images captured directly from the microscope, preserving the raw illumination and color characteristics of the stained slides. Processed CIELAB (Enhanced) Dataset: Contains images that have undergone a specific computational enhancement pipeline. This pipeline includes spatial noise reduction using a Median Filter (3x3 kernel), illumination equalization via Contrast Limited Adaptive Histogram Equalization (CLAHE) applied to the 'L' channel, and a color space transformation where the original 'Blue' channel is synthetically replaced by the 'a' channel from the CIELAB color space. This synthesis maximizes the visual separation between the bacilli pigments and the background. Data Structure & Format: Both the raw and enhanced datasets are explicitly divided into train and val (validation) subfolders to facilitate immediate machine learning model training. All image annotations are provided as text files (.txt) strictly following the standard YOLO bounding box format (normalized coordinates: class_id x_center y_center width height). The object class ID for Mycobacterium tuberculosis is set to 0. Potential Use Cases: Researchers and developers in computer vision and healthcare diagnostics can utilize this dual-version dataset to build, benchmark, and improve object detection algorithms (such as the YOLO family) for automated tuberculosis screening. Furthermore, it serves as a ready-to-use resource for evaluating how color space transformations affect model robustness against common microscopic imaging artifacts.

Files

Steps to reproduce

1. Extract the provided ZIP files (Hayear_Raw_Dataset.zip and Processed_CIELAB_Dataset.zip). 2. The images are already organized into 'train' and 'val' splits with synchronized YOLO format annotations. 3. To replicate the exact enhancement process applied to the raw images, run the included Python script 'preprocessing_cielab.py' in the same directory as the raw dataset.

Institutions

Categories

Computer Science

Licence