Unified cancer and non-cancer transcriptomic data; cancer signatures identified by the TAVNIT software
Description
This dataset contains unified cancer and non-cancer transcriptomic data. Here are data for the entire transcriptome and surfaceome (Supp 1). The data are applicable as input for the TAVNIT software. TAVNIT is a novel tool for cancer subtyping and subsequent extraction of druggable cancer signatures. Another part of the deposition is the output of TAVNIT executions – cell-surface cancer signatures. The cancer signature represents a record outlining transcription values of some number of genes that are largely sufficient to distinguish cancer samples from non-cancer samples in a given dataset. The deposited signatures are cell-surface ones. Their respective proteins can be potential targets for CAR-T cells. The signatures have been extracted for all cancer samples (Supp 2) and for prostate adenocarcinoma samples (Supp 3).
Files
Steps to reproduce
The input data were prepared as follows. The initial transcriptomic data were sourced from [1]. The initial list of human cell-surface polypeptides was acquired from [2]. Bausch-Fluck et al., 2018 [2] outlined the human surfaceome, which consists of 2,886 polypeptides. However, not all of them could be associated with a specific gene. Therefore, an intermediate list of 2,801 genes was produced. In the article of Wang et al., 2018 [1], not all genes had expression values for all samples. To make these data comparable, two gene lists—surfaceomic (2,158 genes) and entire (18,155 genes)—have been compiled, which have been totally identical for all samples (Supp 1). TAVNIT exploits clustering with constraints to separately group cancer and non-cancer samples and utilizes Ant Colony Optimization for signature extraction. Here, clustering was carried out using the Pearson distance metric with the pre-defined division of cancer and non-cancer samples. Afterward, signature extraction was executed ten times using the entire input data with the following settings: maximum number of terms in signature 5 ant colony size (the number of ants) 50 number of iterations yielding the same best signature for identifying signature convergence 10 minimum number of samples covered by a signature 10 maximum number of uncovered samples to terminate the process 500 inclusion threshold for cannot clusters 0.1 For signature extraction, all non-cancer clusters were marked as cannot-clusters (i.e., should not be covered by extracted signatures). Signature extraction was solely performed for cell-surface proteins. Thereafter, signature extraction script was run ten more times with the following settings: maximum number of terms in signature 5 ant colony size (the number of ants) 50 number of iterations yielding the same best signature for identifying signature convergence 10 minimum number of samples covered by a signature 10 maximum number of uncovered samples to terminate the process 10 inclusion threshold for cannot clusters 0.1 For these runs, only samples of prostate adenocarcinoma and all non-cancer samples were included as input data. Similarly, all non-cancer clusters were designated as cannot-clusters, and the signature extraction was restricted to cell-surface proteins. The TAVNIT software is accessible at https://github.com/meshuga-git/TAVNIT 1. Wang Q, Armenia J, Zhang C, Penson AV, Reznik E, Zhang L, Minet T, Ochoa A, Gross BE, Iacobuzio-Donahue CA, Betel D, Taylor BS, Gao J, Schultz N. Unifying cancer and normal RNA sequencing data from different sources. Sci Data. 2018 Apr 17;5:180061 2. Bausch-Fluck D, Goldmann U, Müller S, van Oostrum M, Müller M, Schubert OT, Wollscheid B. The in silico human surfaceome. Proc Natl Acad Sci U S A. 2018 Nov 13;115(46):E10988-E10997.