A Comprehensive Raw Dataset of Ziehl-Neelsen Stained Sputum Smear Microscopy Images for Mycobacterium Tuberculosis Detection

Published: 27 April 2026| Version 5 | DOI: 10.17632/34gymtj5yc.5
Contributors:
,
,
,
, Edio da Costa

Description

This dataset presents a comprehensive collection of 1,338 raw sputum smear microscopy images containing 11,248 manually annotated ground-truth bounding box labels for the detection of Mycobacterium tuberculosis. The clinical specimens were sourced from Dr. Mohamad Soewandhi Regional General Hospital and Surabaya Pulmonary Hospital, Indonesia. To ensure high natural variability and prevent sensor bias, the images were acquired using standard optical microscopes equipped with two distinct digital camera systems: Hayear and Optilab. To address common challenges in microscopic imaging—such as uneven background illumination, dust artifacts, and spatial noise—this dataset intentionally preserves the raw, unmodified characteristics of the stained slides to represent true clinical environments. Instead of relying on computational enhancements, a detailed metadata.csv file is provided. This file categorizes each image based on its dominant background color condition (e.g., Greenish, Bluish, Purplish/Pinkish, Yellowish) resulting from natural staining thickness and differing camera sensor responses. Data Structure & Format: The dataset is systematically partitioned into 'train', 'val' (validation), and 'test' subdirectories to facilitate immediate machine learning model training. All image annotations are natively provided as plain text files (.txt) strictly adhering to the standard YOLO bounding box format (normalized coordinates: class_id x_center y_center width height). The object class ID for Mycobacterium tuberculosis is designated as 0. Potential Use Cases: Researchers and developers in computer vision and healthcare diagnostics can utilize this dataset to build, benchmark, and improve object detection algorithms (such as the YOLO family) for automated tuberculosis screening. Furthermore, the explicit inclusion of diverse camera sources (Hayear and Optilab) and detailed color metadata serves as a ready-to-use resource for evaluating model robustness and generalizability across varying real-world microscopic imaging conditions.

Files

Steps to reproduce

1. Extract the provided primary ZIP file (e.g., Raw_Sputum_Microscopy_Dataset.zip) containing the raw microscopy images and their annotations. 2. The dataset is systematically organized into 'train', 'val' (validation), and 'test' directories. Within each split, the raw microscopic images (.jpg/.png) are located in the images folder, while their corresponding YOLO-format bounding box annotations (.txt) are securely synchronized in the labels folder. 3. Refer to the included metadata.csv file to identify the specific imaging conditions for each file. Researchers can match the Image_ID in the CSV to categorize the data based on its natural background illumination and color variations (e.g., Greenish, Bluish, Purplish/Pinkish, Yellowish). This allows for stratified model evaluations or the testing of custom preprocessing algorithms across different camera sensors (Hayear and Optilab) and staining characteristics.

Institutions

Categories

Computer Science

Licence