APCData cervical cytology cells

Published: 20 June 2024| Version 1 | DOI: 10.17632/ytd568rh3p.1
, Virginia Eva Pachiarotti,


We present a Dataset of cervical cytology images developed in collaboration with the Anatomical Pathology and Cytology laboratory located in Rivera, Uruguay. The set includes 425 high-quality microscope field images, with cells labeled in 6 classes corresponding to the Bethesda System for reporting cervical cytology (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9168399/): NILM (Negative for Intraepithelial Lesion or Malignancy), ASC-US (Atypical Squamous Cells of Undetermined Significance), ASC-H (Atypical Squamous Cells, cannot exclude HSIL), LSIL (Low-grade Squamous Intraepithelial Lesion), HSIL (High-grade Squamous Intraepithelial Lesion), and SCC (Squamous Cell Carcinoma). The cells were labeled using bounding boxes and also points in nuclei.


Steps to reproduce

The set consists of 425 images of 2048 x 1532 pixels corresponding to 73 diagnosed Pap smear studies, performed between 2018 and 2021, using the liquid-based cytology technique by cytocentrifugation. The images were obtained using an Olympus CX40RF100 microscope, with a 40x objective lens, a 10x eyepiece, and an Olympus LC30 Optical Microscope camera. They were digitally processed using Olympus L.Cmicro software version 2.2, year 2017. Two stages of cell labeling were carried out, corresponding to 6 categories defined by the Bethesda system, which are NILM, ASC-US, ASC-H, LSIL, HSIL, and SCC. The first stage used the image annotation tool LabelImg (https://github.com/HumanSignal/labelImg), where bounding boxes were generated for the cells in the appropriate format for use with the YOLO convolutional neural network architecture. Each image corresponds to a .txt file of labels, where each line corresponds to a cell and contains the following data: class, x_center, y_center, width, and height. The data are available in the APCData_YOLO folder, organized as follows: images (APCData_YOLO\images), labels (APCData_YOLO\labels). Additionally, a .txt file with the description of classes is available at APCData_YOLO\classes.txt. The second stage used the web platform CRIC (https://playground.database.cric.com.br/), where for each cell annotated in the previous stage, a point was marked at the center of the corresponding nucleus. The system generates two files for each image, a .csv and a .json. For these cases, the important data to consider are: .csv file {bethesda_system, nucleus_x, nucleus_y}, and .json file {"bethesda_system": " ", "nucleus_x": , "nucleus_y": }. The data are available in the APCData_points folder, organized as follows: images (APCData_points\images), .csv labels (APCData_points\labels\csv), and .json labels (APCData_points\labels\json). A total of 3,619 squamous cells were labeled: 2,114 NILM, 333 ASC-US, 444 LSIL, 182 ASC-H, 421 HSIL, and 125 SCC.


Artificial Intelligence, Biomedical Imaging, Gynecologic Cytology