Curated Dataset for COVID-19 Posterior-Anterior Chest Radiography Images (X-Rays).

Published: 11-05-2021| Version 3 | DOI: 10.17632/9xkhgts2s6.3
Gokul Lal KV,
Sunny Prakash Prajapati,
Rahul Bhaumik,
Sanjana Shivakumar,
Kriti Bhalla


This is a combined curated dataset of COVID-19 Chest X-ray images obtained by collating 15 publically available datasets as listed under the references section. The present dataset contains 1281 COVID-19 X-Rays, 3270 Normal X-Rays, 1656 viral-pneumonia X-Rays, and 3001 bacterial-pneumonia X-Rays. The collected datasets—as cited by this dataset—are combined to form an integrated repository. This integrated repository contains a total of 4558 COVID-19 X-Rays, 5403 Normal X-Rays, 4497 Viral pneumonia X-Rays, and 5768 bacterial pneumonia X-Rays. Out of which 1379 COVID-19 X-Rays, 1476 normal X-Rays, 2690 viral pneumonia X-Rays, and 2588 bacterial pneumonia X-Rays are found to be duplicates—based on the image similarities—and thus are removed. Inception V3 architecture is used to obtain the image embeddings, which is followed by the use of unsupervised learning algorithms based on cosine similarity distances. These distances are clustered and then visualized to find different categories of image defects which are listed below:— 1.Noise 2.Pixelated 3.Compressed 4.Medical Implants 5.Washed out image 6.Side View 7.CT (sliced) image 8.Aspect Ratio distortion / Cropped / Zoomed 9.Rotated Images 10.Images with annotations These clusters of defective images are removed during the curation process and a refined dataset is obtained which is available for download.