Predicting Invasiveness of Lung Adenocarcinoma from Chest CT with Few-shot Vision-Language Ternary Classification Model
Description
This dataset contains the research data used in the study “Predicting Invasiveness of Lung Adenocarcinoma from Chest CT with Few-shot Vision–Language Ternary Classification Model.” It includes data from 848 patients with pathologically confirmed lung adenocarcinoma collected across four medical centers. The dataset supports a study evaluating the GPT-4o vision–language model for ternary classification of pure ground-glass nodules (pGGNs). The input data for the GPT-4o model are provided in MP4 format and organized into three folders according to pathological subtype: preinvasive lesions (MP4_PRE; n = 333), minimally invasive adenocarcinomas (MP4_MIA; n = 376), and invasive adenocarcinomas (MP4_IAC; n = 139). To promote transparency and reproducibility, the dataset also includes two supplementary scripts, "dicm_to_nii.py" and "nii_to_mp4.py", which detail the anonymization and data conversion processes used in this study. These scripts demonstrate the step-by-step transformation from the original DICOM-format CT images to anonymized NIfTI (.nii.gz) files and subsequently to MP4-format videos used as model inputs. This workflow provides researchers with a clear reference for ensuring patient privacy protection when applying online vision–language models to medical imaging data. Due to Mendeley Data’s maximum storage capacity of 10 GB, we uploaded all video data used as inputs for the vision–language models (GPT-4o, Google Gemini 2.5 Pro, and Molmo), which together occupy 9.94 GB of space. Accordingly, this dataset contains only the anonymized video data used for model analysis. The CT images of all patients (totaling 81.1GB) will be disclosed in other databases that can provide the corresponding capacity. Users of this dataset please cite the following publication: “Predicting Invasiveness of Lung Adenocarcinoma from Chest CT with Few-shot Vision–Language Ternary Classification Model.”
Files
Steps to reproduce
Anonymized video (.mp4) files contain the corresponding CT image sequences for each patient, converted into ordered frames for direct input into vision–language models such as GPT-4o. These MP4 files serve as the standardized input format used for model training and evaluation. Please refer to the demonstration video in the supplementary materials attached to this study for detailed steps.
Institutions
- China Medical University