Chest X-ray dataset for lung segmentation

Published: 31 January 2022| Version 1 | DOI: 10.17632/8gf9vpkhgy.1


The proposed dataset has been combined from three popular lung segmentation datasets: Darwin, Montgomery, and Shenzhen. The combined data allow researchers and clinicians to gain access to a good quality dataset, a large proportion of which has been manually annotated. The combined dataset consists of 6,810 images, with corresponding binary masks of lungs with the following distribution of images between the three datasets: • 6,106 images from the Darwin dataset; • 139 images from the Montgomery dataset; • 566 images from the Shenzhen dataset. The Darwin dataset [1, 2] images include most of the heart, revealing lung opacities behind the heart, which may be relevant for assessing the severity of viral pneumonia. The lower-most part of the lungs, where visible, is defined by the extent of the diaphragm. Where present and not obstructive to the distinguishability of the lungs, the diaphragm is included up until the lower-most visible part of the lungs. A key property of this dataset is that image resolutions, sources, and orientations vary across the dataset, with the smallest image being 156x156 pixels and the largest being 5600x4700 pixels. Furthermore, we included the portable X-ray images which are of significantly lower quality as compared to standard X-rays. A key limitation of the Darwin dataset is that it does not contain lateral X-ray lung segmentations. It is worth noting that lung segmentations were performed by human annotators using Darwin's Auto-Annotate AI and then adjusted and reviewed by expert radiologists. Both the Montgomery and Shenzhen datasets [3] were published by the United States National Library of Medicine and are made of posteroanterior chest X-ray images. These images are available to foster research in computer-aided diagnosis of pulmonary diseases with a special focus on pulmonary tuberculosis. The datasets were acquired from the Department of Health and Human Services (Maryland, USA) and Shenzhen №3 People's Hospital (Shenzhen, China). Both datasets contain normal and abnormal chest X-ray images with manifestations of tuberculosis and include associated radiologist readings. References: 1. Darwin’s Auto-Annotate AI. Available: 2. COVID-19 X-ray dataset. Available: 3. Jaeger S, Candemir S, Antani S, Wáng Y-XJ, Lu P-X, Thoma G. Two public chest X-ray datasets for computer-aided screening of pulmonary diseases. Quant Imaging Med Surg. 2014;4: 475–477. doi:10.3978/j.issn.2223-4292.2014.11.20



Nacional'nyj issledovatel'skij Tomskij politehniceskij universitet, Beth Israel Deaconess Medical Center, Georgia State University


Image Segmentation, Machine Learning, Lung, Pneumonia, X-Ray, Tuberculosis, Deep Learning, COVID-19