Single-cell RNA sequencing (scRNA-seq) technology is one of the most cost-effective and efficacious methods for revealing cellular heterogeneity and diversity. Precise identification of cell types is essential for establishing a robust foundation for downstream analyses and is a prerequisite for understanding heterogeneous mechanisms. However, the accuracy of existing methods warrants improvement, and highly accurate methods often impose stringent equipment requirements. Moreover, most unsupervised learning-based approaches are constrained by the need to input the number of cell types a prior, which limits their widespread application. In this paper, we propose a novel algorithm framework named WLGG. Initially, to capture the underlying nonlinear information, we introduce a weighted distance penalty term utilizing the Gaussian kernel function, which maps data from a low-dimensional nonlinear space to a high-dimensional linear space. We subsequently impose a Lasso constraint on the regularized Gaussian graphical model to enhance its ability to capture linear data characteristics. Additionally, we utilize the Eigengap strategy to predict the number of cell types and obtain predicted labels via spectral clustering. The experimental results on 14 test datasets demonstrate the superior clustering accuracy of the WLGG algorithm over 16 alternative methods. Furthermore, downstream analysis, including marker gene identification, pseudotime inference, and functional enrichment analysis based on the similarity matrix and predicted labels from the WLGG algorithm, substantiates the reliability of WLGG and offers valuable insights into biological dynamic biological processes and regulatory mechanisms.
Keywords: cell type identification; graphical model; pseudotime analysis; scRNA-seq data; weighted distance.
© The Author(s) 2024. Published by Oxford University Press.