UFSC OCPap: Papanicolaou Stained Oral Cytology Dataset (v4)
The UFSC OCPap dataset comprises 9,797 labeled images of 1200x1600 pixels acquired from 5 slides of cancer diagnosed and 3 healthy of oral brush samples, from distinct patients. The slides were provided by the Hospital Dental Center of the University Hospital Professor Polydoro Ernani de São Thiago of Federal University of Santa Catarina (HU-UFSC) and this research was approved by the UFSC Research Ethics Committee (CEPSH), protocol number 23193719.5.0000.0121. All patients were previously approached and informed about the study objectives. Those who agreed to participate signed an Informed Consent Form.
Steps to reproduce
The slides were prepared and stained using the Papanicolaou conventional technique and captured with an Axio Scan.Z1 microscope and a Hitachi HV-F202SCL camera. Each slide originated an image of 214,000x161,000 pixels (0.111μm x 0.111μm per pixel) that we divided into the tiles (or patches) that originated the dataset. Five specialists labeled these images using the LabelMe and LabelBox tools and an experienced pathologist revised 35% of the annotations. We split the dataset into three subsets attempting to maintain the dataset class proportions: ~70% for training (4,745 images), ~15% for validation (976 images), and ~15% for test (1,032 images). The labeling process generated masks for each image at each subset with pixels labeled as "background", "abnormal epithelial nucleus", "healthy epithelial nucleus", "out of focus nucleus", "blood cell nucleus", "reactive cell nucleus", and "dividing nucleus". For binary classification, we converted the masks into two classes with "background" as "background" and the other classes as "nucleus". For each marked nucleus, we extracted one centralized 256x256 pixels image with corresponding labels to test classification models and a set of labeled bounding boxes to test object detection models.