Single Nucleus RNA Sequencing of Pre-Malignant Liver Reveals Disease-Associated Hepatocyte State with HCC Prognostic Potential
Current approaches to stage chronic liver diseases have limited utility to directly predict liver cancer risk. Here, we employed single nucleus RNA sequencing (snRNA-seq) to characterize the cellular microenvironment of healthy and chronically injured pre-malignant livers using two distinct mouse models. Analysis of 40,748 hepatic nuclei unraveled a previously uncharacterized disease-associated hepatocyte transcriptional state (daHep). These cells were absent in healthy livers, but were increasingly prevalent as chronic liver disease progressed towards hepatocarcinogenesis. Gene expression deconvolution of 1,439 human liver transcriptomes from publicly available datasets revealed that daHep frequencies highly correlate with current histopathological liver disease staging systems. Importantly, we show that high daHep levels precede carcinogenesis in mice and humans and predict a higher risk of hepatocellular carcinoma (HCC) development. This novel transcriptional signature with diagnostic and, more importantly, prognostic significance has the potential to change the way chronic liver disease patients are staged, surveilled and risk-stratified.
Steps to reproduce
Cell Ranger outputs were read into individual Seurat R package v4 objects (Hao et al., 2021) using the functions Read10x, then CreateSeuratObject. For each sample independently, quality control filtering was done based on the number of features (nGene) and the percentage of mitochondrial RNA. Only barcodes with > 500 and < 3000 genes and with less than 5% mitochondrial genes were maintained. Seurat objects corresponding to individual samples were merged into one combined object, then data was normalized, scaled, and the top 2000 variable features identified using the functions NormalizeData, ScaleData and FindVariableFeatures respectively. Next, we implemented a manual supervised approach to remove low quality and doublet barcodes. The approach was based on successive rounds of clustering, identification and removal of clusters corresponding to low quality and doublet nuclei. Low quality clusters likely corresponded to empty droplets that were contaminated with ambient RNA, these were characterized by presenting a low average number of features and expression of highly expressed cell type-specific genes from multiple cell types. Doublets were identified and removed based on high expression of canonical cell type-specific genes from two cell types; these clusters also presented an average number of features above the mean of other clusters in the dataset. The standard Seurat workflow recommends linear dimensional reduction by principal component analysis (PCA), followed by clustering and non-linear dimensional reduction (tSNE and UMAP). When this approach was utilized, clusters were driven by treatment group instead of cell types (Figure S1C). Thus, we implemented Batchelor (Haghverdi et al., 2018), a batch correction approach based on mutual nearest neighbor (MNN), then passed the top 25 components of the MNN output to the FindNeighbors, RunTSNE and RunUMAP functions and calculated the Louvain clusters using the FindClusters function with a resolution of 0.05. This approach resulted in clusters driven by cell type that were contributed by barcodes originating from all treatment groups (Figure S1D). Using the above approach, we obtained a combined dataset with a total of 40,748 nuclei from n=9 mice (three per treatment group) and 28,692 genes detected. Differential expression analysis was conducted using the default Wilcoxon Rank Sum test with the FindAllMarkers function retaining only those genes expressed in at least 25% of the cells in a given cluster and a log-fold change of at least 0.25 compared to all remaining cells. Nine clusters were obtained and annotated based on cell-specific marker expression. Individual clusters corresponding to hepatocytes, mesenchymal, endothelial, biliary epithelial and myeloid lineages were subset in separate objects for re-clustering. Each of these subsets were reanalyzed in isolation similarly to above, however using the FindClusters function with a resolution between 1 and 2.5.