Single Nucleus RNA Sequencing of Pre-Malignant Liver Reveals Disease-Associated Hepatocyte State with HCC Prognostic Potential
Current approaches to stage chronic liver diseases have limited utility to directly predict liver cancer risk. Here, we employed single nucleus RNA sequencing (snRNA-seq) to characterize the cellular microenvironment of healthy and chronically injured pre-malignant livers using two distinct mouse models. Analysis of 40,748 hepatic nuclei unraveled a previously uncharacterized disease-associated hepatocyte transcriptional state (daHep). These cells were absent in healthy livers, but were increasingly prevalent as chronic liver disease progressed towards hepatocarcinogenesis. Ultra-low-pass whole genome sequencing (LP-WGS) followed by copy number variation (CNV) analysis of microdissected mouse tissue demonstrated that daHep enriched regions are riddled with structural variants, suggesting these cells represent a pre-malignant intermediary. Integrated analysis of three recent human snRNA-seq datasets confirmed the presence of a similar phenotype in human chronic liver disease and further supported its enhanced mutational burden compared to normal hepatocytes. Gene expression deconvolution of 1,439 human liver transcriptomes from publicly available datasets revealed that daHep frequencies highly correlate with current histopathological liver disease staging systems. Importantly, we show that high daHep levels precede carcinogenesis in mice and humans and predict a higher risk of hepatocellular carcinoma (HCC) development. This novel transcriptional signature with diagnostic and, more importantly, prognostic significance has the potential to change the way chronic liver disease patients are staged, surveilled and risk-stratified.
Steps to reproduce
For mouse snRNA-seq data, Cell Ranger outputs were read into individual Seurat R package v4 objects (Hao et al., 2021) using the functions Read10x, then CreateSeuratObject. For each sample independently, quality control filtering was done based on the number of features (nGene) and the percentage of mitochondrial RNA. Only barcodes with > 500 and < 3000 genes and with less than 5% mitochondrial genes were maintained. Seurat objects corresponding to individual samples were merged into one combined object, then data was normalized, scaled, and the top 2000 variable features identified using the functions NormalizeData, ScaleData and FindVariableFeatures respectively. Next, we implemented a manual supervised approach to remove low quality and doublet barcodes. The approach was based on successive rounds of clustering, identification and removal of clusters corresponding to low quality and doublet nuclei. Low quality clusters likely corresponded to empty droplets that were contaminated with ambient RNA, these were characterized by presenting a low average number of features and expression of highly expressed cell type-specific genes from multiple cell types. Doublets were identified and removed based on high expression of canonical cell type-specific genes from two cell types; these clusters also presented an average number of features above the mean of other clusters in the dataset. The standard Seurat workflow recommends linear dimensional reduction by principal component analysis (PCA), followed by clustering and non-linear dimensional reduction (tSNE and UMAP). When this approach was utilized, clusters were driven by treatment group instead of cell types (Figure S1C). Thus, we implemented Batchelor (Haghverdi et al., 2018), a batch correction approach based on mutual nearest neighbor (MNN), then passed the top 25 components of the MNN output to the FindNeighbors, RunTSNE and RunUMAP functions and calculated the Louvain clusters using the FindClusters function with a resolution of 0.05. Differential expression analysis was conducted using the default Wilcoxon Rank Sum test with the FindAllMarkers function retaining only those genes expressed in at least 25% of the cells in a given cluster and a log-fold change of at least 0.25 compared to all remaining cells. Nine clusters were obtained and annotated based on cell-specific marker expression. Individual clusters were subset in separate objects for re-clustering. Each of these subsets were reanalyzed in isolation similarly to above, however using the FindClusters function with a resolution between 1 and 2.5. For human snRNA-seq data, Cell Ranger outputs were downloaded from the following Gene Expression Omnibus datasets GSE185477, GSE174748, GSE192742 and GSE212046. A similar pipeline as above was implemented, however QC filtering was based on number of UMIs > 500 and < 15,000.