Data matrix and metafiles for CRC scRNAseq.rds.gz: A total of 54,103 cell transcriptomes from 5 colorectal cancer patients of which 29,481 cells originated from tumors and 24,622 cells from adjacent non-malignant tissues. The file is the Seurat object list which includes the expression matrix and annotation of 58 cell types. CellRanger_Output: Sequenced reads were then mapped to GRCh38 whole genome using 10X Genomics' Cell Ranger 3 software's cell ranger count function. This compressed file includes Cell Ranger output of filtered_feature_bc_matrix for each sample.
Steps to reproduce
1. Raw 10X read alignment, quality control and normalization Raw sequencing reads were quality checked and transformed into bcl file with FastQC software v0.11.9 and Illumina bcl2fastq2 Conversion Software v2.20, at https://www.bioinformatics.babraham.ac.uk/projects/fastqc/ and https://support.illumina.com/downloads/bcl2fastq-conversion-software-v2-20.html, respectively. Standard pipelines of cell ranger were used to do sequence processing, alignment to GRch38 genome with default parameters (https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines /latest/). 2. Dimension reduction and clustering analysis We scaled data with top 2000 most variable genes by using FindVariableFeatures function in R package Seurat v3. Clustering . We used variable genes for principal component analysis (PCA), used FindNeighbors in Seurat to get nearest neighbors for graph clustering based on PCs, and used FindCluster in Seurat to obtain cell subtypes, and visualized cells with the uniform manifold approximation and projection (UMAP) algorithm. To eliminate the batch effect, we performed harmony algorithm in Harmony R package  to remove batch correction before clustering analysis, and applied FindNeighbors and FindCluster in Seurat to obtain cell subtypes. Cells were clustered at two stages of the analysis, partitioned cells into epithelial, stromal, myeloid, T, and B cells in first stage, then clustered cells from multiple samples into distinct subtypes in the second stage. For the first step, the clusters were scored for the previously described gene signatures , including epithelial cells (EPCAM, KRT8, KRT18), stromal cells (COL1A1, COL1A2, COL6A1, COL6A2, VWF, PLVAP, CDH5, S100B), myeloid cells (CD68, XCR1, CLEC9A, CLEC10A, CD1C, S100A8, S100A9, TPSAB1, and OSM), T cells (NKG7, KLRC1, CCR7, FOXP3, CTLA4, CD8B, CXCR6, and CD3D), and B cells (MZB1, IGHA1, SELL, CD19, and AICDA). Signature scores were calculated as the mean log2(LogNormalizedUMI+1) across all genes in the signature. Each cluster was assigned to the compartment of its maximal score and all cluster assignments were manually checked to ensure the accurate partition of cells. For the second step, we performed harmony algorithm before clustering analysis to remove batch correction, and applied FindNeighbors and FindCluster in Seurat to obtain cell subtypes. As an auxiliary tool, we defined 58 cell types in CRC based on the gene signatures of each cell type and known lineage markers.