TCGA-Reports: A Machine-Readable Pathology Report Resource for Benchmarking Text-Based AI Models. Kefeli et al.
This dataset contains 9,523 machine-readable pathology reports originally published by The Cancer Genome Atlas (TCGA) in PDF format. 32 tissues or cancer-types are represented in the final dataset, and reports were generated at a wide variety of institutions. The reports have undergone data selection, OCR-translation, TCGA-artifact removal, form removal, and site-specific section header removal. The final dataset consists of high-quality, clean text with high information content. Please see Kefeli et al. (2024) for additional details.
Steps to reproduce
PDFs were downloaded from the Genomic Data Commons (GDC) portal for TCGA data. Some PDFs were either of low-scan quality or consisted of placeholder "missing report" sheets, and were removed. Reports were then decomposed into high-quality page-images and OCR-translated via Amazon Web Services Textract. Multiple-choice forms were not translated consistently, and therefore were identified and removed. Output response files were subsequently thoroughly post-processed to ensure TCGA-generated artifacts were removed. TCGA-inserted quality control tables, TCGA generated hand-written annotation, and TCGA-imposed redaction bars, all artifacts that can interrupt true report text and impart noise on the final text, were removed. In addition, site-specific section headers were manually identified and removed automatically using regular expressions. The final dataset consists of clean, high-quality and high-information-content text, which can be combined with other TCGA data such as genomic and whole slide image data toward building more accurate clinical models.