A Comprehensive Raw Dataset of Ziehl-Neelsen-Stained Sputum Smear Microscopy Images for Mycobacterium tuberculosis Detection

Published: 23 May 2026| Version 9 | DOI: 10.17632/34gymtj5yc.9
Contributors:
,
,
,
, Edio da Costa

Description

This dataset provides a comprehensive collection of raw Ziehl-Neelsen-stained sputum smear microscopy images for Mycobacterium tuberculosis bacilli detection. The dataset contains 1,438 raw microscopy images and 11,447 manually annotated bounding-box labels in YOLO format. Images were acquired using two digital microscope camera systems, Hayear and Optilab, to introduce multi-sensor variability and better represent real clinical microscopy conditions. The dataset intentionally preserves the original raw image characteristics, including natural illumination variation, staining differences, background color shifts, sensor response, and microscopic artifacts. No synthetic image enhancement, color normalization, contrast adjustment, or artificial augmentation was applied to the released images. This makes the dataset suitable for evaluating object detection models under realistic clinical imaging conditions. A metadata.csv file is provided to support stratified analysis. The metadata includes Image_ID, Data_Split, Background_Color, and Camera_System information. The Background_Color field categorizes images into Yellowish, Purplish/Pinkish, Bluish, and Greenish profiles, while the Camera_System field identifies whether each image belongs to the Hayear or Optilab camera group. These metadata fields enable researchers to analyze model performance across different staining appearances, illumination conditions, and camera-system characteristics.

Files

Steps to reproduce

1. Extract the provided primary ZIP file, Raw_Sputum_Microscopy_Dataset.zip, which contains the raw sputum smear microscopy images, YOLO-format annotations, and metadata.csv file. 2. The dataset is systematically organized into training, validation, and testing subsets using the following directory structure: images/train, images/val, images/test, labels/train, labels/val, and labels/test. The raw microscopic images (.jpg/.png) are stored in the images folders, while the corresponding YOLO-format bounding-box annotation files (.txt) are stored in the labels folders. Each annotation file has the same base filename as its corresponding image file, ensuring direct synchronization between images and labels. 3. Open the metadata.csv file to cross-reference each image using the Image_ID field. The metadata file provides the Data_Split, Background_Color, and Camera_System information for each image. The Background_Color field categorizes natural background illumination and staining variation into Yellowish, Purplish/Pinkish, Bluish, and Greenish profiles, while the Camera_System field identifies the corresponding Hayear or Optilab camera-system group. 4. Researchers may use the metadata.csv file to perform stratified model evaluation, hard-example mining, robustness analysis, or custom preprocessing experiments across different background-color profiles and camera-system groups. This allows model performance to be assessed under realistic variations in staining appearance, illumination, and sensor characteristics.

Institutions

Categories

Computer Science

Licence