UFSC Multi-Cyto: Multi-Stain Cytopathology Image Dataset

Published: 8 July 2025| Version 1 | DOI: 10.17632/b8bygdxr7r.1
Contributors:
Luís Otávio Santos,
,
,
,
,
,

Description

The UFSC Multi-Stain Cytopathology Dataset is a large, heterogeneous resource for research in cytological image analysis. It contains 17,422 expert-annotated image patches (1200×1600 or 1600×1200 pixels) with a total of 119,874 instance-level annotations, derived from 28 whole-slide images (WSIs) acquired with an Axio Scan.Z1 microscope. The dataset combines three staining protocols: Papanicolaou (oral cavity), AgNOR (cervix), and Feulgen (cervix), representing a broad spectrum of nuclear morphologies from two anatomical regions. Annotations were generated in multiple stages by specialist cytopathologists and reviewed for quality assurance. Data are provided in both COCO and YOLO formats and are divided into training (12,502), validation (2,448), and test (2,472) sets. This dataset supports the development and evaluation of segmentation models across different cytological stains and clinical scenarios. All data collection was conducted under protocols approved by the UFSC Research Ethics Committee (23193719.5.0000.0121 and 57423616.3.0000.0121).

Files

Steps to reproduce

To reproduce the dataset, start by obtaining the original OCPap (https://data.mendeley.com/datasets/dr7ydy9xbk/2), CCAgT (https://data.mendeley.com/datasets/wg4bpm33hj/1), and Feulgen (https://arquivos.ufsc.br/d/7e7ac2f498df4cf9aa7d/) collections. Digitize all slides using an Axio Scan.Z1 microscope at 0.111 μm per pixel and divide each whole-slide image into patches of 1200×1600 or 1600×1200 pixels. Perform manual annotation for each patch using LabelMe or Labelbox, including a multi-stage review by expert cytopathologists. Clean the data by removing patches without valid segmentations, correcting or excluding corrupt masks, and converting all polygon coordinates to integer values. Identify and remove duplicate images by computing image hashes and comparing metadata, keeping only unique entries. In cases of conflicting annotations for a single image, retain the most recent or consensus annotation. Standardize all class labels and metadata across the dataset. Merge the cleaned datasets and export annotations in both COCO and YOLO formats. Finally, split the unified dataset by patient and stain, preserving class distribution among training, validation, and test subsets.

Institutions

Universidade Federal de Santa Catarina

Categories

Cytopathology, Gynecologic Cytology, Papanicolaou Screening

Licence