Multi-Label Retinal Diseases (MuReD) Dataset
Abstract: Early detection of retinal diseases is one of the most important means of preventing partial or permanent blindness in patients. One of the major stumbling blocks for manual retinal examination is the lack of a sufficient number of qualified medical personnel per capita to diagnose diseases. Computer-aided diagnosis systems (CAD) have proven to be very effective in helping physicians reduce the time taken to make a diagnosis and minimize variability in image interpretation. Still, they are not flexible enough to accommodate the simultaneous presence of multiple retinal diseases, which is a common situation in real-world applications. In the past years, few datasets that focus on the classification of numerous retinal pathologies present at the same time, i.e., multi-label classification have been proposed, but there are some shared problems with all of them, such as a narrow range of pathologies to classify, high level of class imbalance, low amount of samples for the underrepresented labels, no assurance in image quality, among others. All these problems hinder the performance of any model trained with these datasets, which leads to poor robustness, lack of generalization, and reduced trustability in its predictions. To address these problems, we constructed the Multi-Label Retinal Diseases (MuReD) dataset, using images collected from three different state-of-the-art sources, i.e., ARIA, STARE, and RFMiD datasets, and performing a sequence of post-processing steps to ensure the quality of the images, a wide range of diseases to classify, and a sufficient number of samples per disease label. The MuReD dataset consists of 2208 images with 20 different labels, with varying image quality and resolution. At the same time, ensuring a minimal degree of quality in the data, with a sufficient number of samples per label. To the best of our knowledge, the MuReD dataset, is the only publicly available dataset that applies a sequence of post-processing steps to ensure the quality of the images, the variety of pathologies, and the number of samples per label, resulting in increased data quality and a significant reduction of the class imbalance present in the publicly available datasets. It is envisaged that the MuReD dataset will enable the creation of more robust, general, and trustable models for the automatic detection and classification of retinal diseases. Files Description: 1. The file "train_data.csv" contains the images from the training set, along with the 20 different labels. 2. The file "val_data.csv" contains the images from the validation set, along with the 20 different labels. 3. The folder "images" contains all the images that compose the MuReD dataset. The images come in two different formats, i.e., .tiff and .png. There is no single image resolution. Given that the images come from different sources, resolution can vary from 520x520 to 3400x2800 depending on the source of the image.
Steps to reproduce
All images were collected from the sources below and applied different post-processing cleaning steps: 1. ARIA dataset (http://www.damianjjfarnell.com/?page_id=276) 2. STARE dataset (https://cecas.clemson.edu/~ahoover/stare/) 3. RFMiD dataset (https://dx.doi.org/10.21227/s3g7-st65)