Lung Cancer CT scan image for Federated Learning
Description
This dataset contains 3,200 high-quality slice-level lung CT images categorized into three distinct diagnostic classes. It is designed to support machine learning, deep learning, and federated learning research for the automatic detection and classification of lung cancer, contributing to innovations in privacy-preserving medical AI, multi-institutional collaboration, and explainable diagnostic systems. Dataset Composition: The dataset is organized into three distinct classes based on standard radiological features and pathological definitions: ⦁ Malignant (1,310 images): Confirmed carcinomas, such as Adenocarcinoma, typically characterized by irregular margins, nodular densities, and surrounding parenchymal distortion. ⦁ Benign (905 images): Non-cancerous nodules, such as Hamartomas or Granulomas, characterized by circumscribed structures, smooth margins, or calcification. ⦁ Normal (985 images): Scans exhibiting clear lung parenchyma with no evidence of nodules or pathological focal points. Data Preprocessing and Structuring: ⦁ Images are formatted as single-channel (grayscale) CT slices. ⦁ The dataset underwent a standardized preprocessing pipeline where images were resized to 224×224 pixels to ensure uniformity across samples. ⦁ Scans were enhanced via Contrast Limited Adaptive Histogram Equalization (CLAHE) to handle differences in brightness and contrast, and normalized using shared per-channel mean and variance to ensure consistent pixel intensity (range of 0 to 1). Applications: This dataset can be effectively used for: ⦁ Medical image classification and computer-aided oncology diagnostics. ⦁ Development and benchmarking of advanced deep learning architectures (e.g., CNNs, Vision Transformers, and Hybrid models). ⦁ Simulating decentralized, multi-institutional Federated Learning (FL) environments, specifically for testing Non-IID (Independent and Identically Distributed) data partitioning with quantity skew and label heterogeneity. ⦁ Evaluating Explainable AI (XAI) frameworks, such as validating Grad-CAM saliency maps and quantitative trustworthiness metrics like Deletion AUC. File Information: ⦁ Total Images: 3,200 ⦁ Color Space: Grayscale (Single-channel) ⦁ Resolution: 224×224 pixels ⦁ Folder Structure: Each class is stored in a separate labeled directory (Malignant, Benign, Normal).
Files
Steps to reproduce
1. Download the dataset: Access the dataset files from this Mendeley Data repository and extract the contents into your working directory. 2. Dataset structure: The dataset is organized into three distinct directories based on the diagnostic classes, allowing for easy loading using standard image data generators: LungCT_Dataset/ ├── Malignant/ │ ├── malignant_001.png │ ├── ... ├── Benign/ │ ├── benign_001.png │ ├── ... └── Normal/ ├── normal_001.png ├── ... 3. Data preprocessing: ⦁ Ensure all images are loaded as single-channel (grayscale) CT slices and verify the resolution is resized to the harmonized 224×224 pixels. ⦁ Apply Contrast Limited Adaptive Histogram Equalization (CLAHE) to enhance the structural visibility of the lung parenchyma and nodules. ⦁ Normalize the pixel intensities to a range of [0, 1] using the shared per-channel mean and variance to ensure uniform distribution across all samples. ⦁ Split the data into training, validation, and testing sets (e.g., 70/15/15), ensuring proper patient-level separation if metadata is available. 4. Model training example: ⦁ Load the data using frameworks like TensorFlow or PyTorch. ⦁ Select a model architecture suitable for medical imaging, such as a standard CNN (e.g., ResNet, DenseNet), a Vision Transformer (ViT), or a Hybrid model. ⦁ To simulate multi-institutional environments, partition the data across several virtual clients using Non-IID strategies (e.g., Dirichlet distribution) to introduce label heterogeneity and quantity skew, then train using a Federated Learning framework like Flwr or FedML. 5. Evaluation: ⦁ Evaluate the model using standard medical classification metrics, including accuracy, precision, recall (sensitivity), and F1-score across the three classes. ⦁ Assess the diagnostic explainability of your model by generating Grad-CAM saliency maps to verify that the model focuses on relevant pathological features (e.g., irregular margins for Malignant vs. circumscribed structures for Benign). ⦁ Compute quantitative trustworthiness XAI metrics, such as Deletion AUC, to objectively measure the fidelity of the generated explanations.
Institutions
- Daffodil International UniversityDhaka Division, Dhaka