SSCC: A Novel Computational Framework for Rapid and Accurate Clustering Large-scale Single Cell RNA-seq Data

Xianwen Ren; Liangtao Zheng; Zemin Zhang

doi:10.1016/j.gpb.2018.10.003

SSCC: A Novel Computational Framework for Rapid and Accurate Clustering Large-scale Single Cell RNA-seq Data

Genomics Proteomics Bioinformatics. 2019 Apr;17(2):201-210. doi: 10.1016/j.gpb.2018.10.003. Epub 2019 Jun 13.

Authors

Xianwen Ren¹, Liangtao Zheng², Zemin Zhang³

Affiliations

¹ BIOPIC, Beijing Advanced Innovation Center for Genomics, and School of Life Sciences, Peking University, Beijing 100871, China. Electronic address: renxwise@pku.edu.cn.
² BIOPIC, Beijing Advanced Innovation Center for Genomics, and School of Life Sciences, Peking University, Beijing 100871, China.
³ BIOPIC, Beijing Advanced Innovation Center for Genomics, and School of Life Sciences, Peking University, Beijing 100871, China. Electronic address: zemin@pku.edu.cn.

Abstract

Clustering is a prevalent analytical means to analyze single cell RNA sequencing (scRNA-seq) data but the rapidly expanding data volume can make this process computationally challenging. New methods for both accurate and efficient clustering are of pressing need. Here we proposed Spearman subsampling-clustering-classification (SSCC), a new clustering framework based on random projection and feature construction, for large-scale scRNA-seq data. SSCC greatly improves clustering accuracy, robustness, and computational efficacy for various state-of-the-art algorithms benchmarked on multiple real datasets. On a dataset with 68,578 human blood cells, SSCC achieved 20% improvement for clustering accuracy and 50-fold acceleration, but only consumed 66% memory usage, compared to the widelyused software package SC3. Compared to k-means, the accuracy improvement of SSCC can reach 3-fold. An R implementation of SSCC is available at https://github.com/Japrin/sscClust.

Keywords: Classification; Clustering; RNA-seq; Single cell; Subsampling.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Algorithms
Animals
Cluster Analysis
Computational Biology / methods*
Databases as Topic
Gene Expression Profiling / methods
Humans
Mice
Sequence Analysis, RNA*
Single-Cell Analysis*
Software*
Statistics, Nonparametric*