A transcriptomic based deconvolution framework for assessing differentiation stages and drug responses of AML
Count table for LUMC data. We have downloaded the non-normalized count matrices (htseq-counts) and the meta files of the four discovery cohorts (TCGA-LAML, BEAT-AML, TARGET-AML and TARGET-ALL) from https://portal.gdc.cancer.gov. For LEUCEGENE, count data was downloaded from their dedicated site (https://data.leucegene.iric.ca/) along with their provided meta data. All meta/count data were pre-processed using R (v4.1.0). For the meta data, genomic aberration labels were relabeled to the main AML WHO 2016 classes, non-AML samples were removed from the down-stream analyses, ELN-classes were relabeled according to ELN 2017 recommendations. For the count data, ERCC spike-ins and mitochondrial genes were removed, and the count matrix was then sorted according genes standard deviation in order to remove the duplicated genes that had less variation thus providing less information, and lastly the gene ensembl ids were converted to gene symbols. Before converting ensembl ids into gene symbols, the stemness score for each patient was calculated via count-per-million (cpm) normalized libraries.
Steps to reproduce
Our 100 AML samples (LUMC) deposited to EGA with accession number EGAS00001003096 and they are accessible upon request. QC benchmark analyses for these samples were done in our previous paper10. Therefore, we ran default HT-SEQ pipeline (v0.11.2) with paired-end option aligning fastq files to hg38 to obtain the count matrix. All above mentioned preprocessing steps (filtering, gene name conversion) were also conducted for these samples as well before deconvolution.