HistoSet-5×14: A Collection of Balanced Multi-Organ Histopathology Datasets
Description
HistoSet-5×14 is a secondary, derived histopathological image collection created by aggregating, standardizing, and rebalancing images from multiple publicly available and peer reviewed histopathology datasets. The collection covers five organs, namely breast, colon, lung, oral cavity, and ovary, and consists of fourteen cancerous and non cancerous tissue classes. The dataset is intended to support multi class and multi organ machine learning research in computational pathology. The images included in HistoSet-5×14 originate from four established sources. These include the Lung and Colon Cancer Histopathological Image Dataset (LC25000) introduced by Borkowski et al. (2019), the Breast Cancer Histopathological Image Classification dataset by Spanhol et al. (2015), the Oral Cancer Histopathological Imaging Database by Rahman et al. (2020), and the Ovarian Cancer Histopathology dataset proposed by Kasture et al. (2021). All source datasets consist of Hematoxylin and Eosin stained histopathological images that are de identified and publicly available for research use. Since the original datasets exhibit substantial class imbalance and heterogeneous sample sizes, HistoSet-5×14 applies a standardized preprocessing pipeline involving controlled data augmentation for under-represented classes and down-sampling of over-represented classes. Each class was normalized to 2,000 images, yielding a total of 28,000 images. Augmentation was performed conservatively to preserve diagnostically relevant morphological patterns while improving class balance and model robustness. Source Dataset References: 1. Borkowski, A.A., et al. (2019). Lung and colon cancer histopathological image dataset (LC25000). arXiv:1912.12142. 2. Spanhol, F.A., et al. (2015). A dataset for breast cancer histopathological image classification. IEEE Transactions on Biomedical Engineering, 63(7), 1455–1462. 3. Rahman, T.Y., et al. (2020). Histopathological imaging database for oral cancer analysis. Data in Brief, 29, 105114. 4. Kasture, K.R., et al. (2021). A new deep learning method for automatic ovarian cancer prediction & subtype classification. Turkish Journal of Computer and Mathematics Education, 12(12), 1233–1242.