scTab: Scaling cross-tissue single-cell annotation models

Felix Fischer; David S Fischer; Roman Mukhin; Andrey Isaev; Evan Biederstedt; Alexandra-Chloé Villani; Fabian J Theis

doi:10.1038/s41467-024-51059-5

scTab: Scaling cross-tissue single-cell annotation models

Nat Commun. 2024 Aug 4;15(1):6611. doi: 10.1038/s41467-024-51059-5.

Authors

Felix Fischer^{1

2}, David S Fischer^{1

3}, Roman Mukhin⁴, Andrey Isaev⁴, Evan Biederstedt^{5

6

7

8}, Alexandra-Chloé Villani^{6

7

8

9}, Fabian J Theis^{10

11

12}

Affiliations

¹ Department of Computational Health, Institute of Computational Biology, Helmholtz, Munich, Germany.
² School of Computing, Information and Technology, Technical University of Munich, Munich, Germany.
³ Eric and Wendy Schmidt Center, Broad Institute of MIT and Harvard, Cambridge, MA, 02142, USA.
⁴ eBook Applications LLC, Boston, MA, 02467, USA.
⁵ Department of Biomedical Informatics, Harvard Medical School, Boston, MA, 02115, USA.
⁶ Broad Institute of MIT and Harvard, Cambridge, MA, 02142, USA.
⁷ Center for Immunology and Inflammatory Diseases, Massachusetts General Hospital, Charlestown, MA, 02129, USA.
⁸ Krantz Family Center for Cancer Research, Massachusetts General Hospital, Boston, MA, 02114, USA.
⁹ Department of Medicine, Harvard Medical School, Boston, MA, 02115, USA.
¹⁰ Department of Computational Health, Institute of Computational Biology, Helmholtz, Munich, Germany. fabian.theis@helmholtz-munich.de.
¹¹ School of Computing, Information and Technology, Technical University of Munich, Munich, Germany. fabian.theis@helmholtz-munich.de.
¹² TUM School of Life Sciences Weihenstephan, Technical University of Munich, Munich, Germany. fabian.theis@helmholtz-munich.de.

Abstract

Identifying cellular identities is a key use case in single-cell transcriptomics. While machine learning has been leveraged to automate cell annotation predictions for some time, there has been little progress in scaling neural networks to large data sets and in constructing models that generalize well across diverse tissues. Here, we propose scTab, an automated cell type prediction model specific to tabular data, and train it using a novel data augmentation scheme across a large corpus of single-cell RNA-seq observations (22.2 million cells). In this context, we show that cross-tissue annotation requires nonlinear models and that the performance of scTab scales both in terms of training dataset size and model size. Additionally, we show that the proposed data augmentation schema improves model generalization. In summary, we introduce a de novo cell type prediction model for single-cell RNA-seq data that can be trained across a large-scale collection of curated datasets and demonstrate the benefits of using deep learning methods in this paradigm.

MeSH terms

Algorithms
Animals
Computational Biology / methods
Deep Learning
Gene Expression Profiling / methods
Humans
Machine Learning
Neural Networks, Computer
RNA-Seq / methods
Sequence Analysis, RNA / methods
Single-Cell Analysis* / methods
Transcriptome

Grants and funding

DP2 CA247831/CA/NCI NIH HHS/United States