SWIFT-scalable clustering for automated identification of rare cell populations in large, high-dimensional flow cytometry datasets, part 1: algorithm design

Iftekhar Naim; Suprakash Datta; Jonathan Rebhahn; James S Cavenaugh; Tim R Mosmann; Gaurav Sharma

doi:10.1002/cyto.a.22446

SWIFT-scalable clustering for automated identification of rare cell populations in large, high-dimensional flow cytometry datasets, part 1: algorithm design

Cytometry A. 2014 May;85(5):408-21. doi: 10.1002/cyto.a.22446. Epub 2014 Feb 14.

Authors

Iftekhar Naim¹, Suprakash Datta, Jonathan Rebhahn, James S Cavenaugh, Tim R Mosmann, Gaurav Sharma

Affiliation

¹ Department of Computer Science, University of Rochester, Rochester, New York.

Abstract

We present a model-based clustering method, SWIFT (Scalable Weighted Iterative Flow-clustering Technique), for digesting high-dimensional large-sized datasets obtained via modern flow cytometry into more compact representations that are well-suited for further automated or manual analysis. Key attributes of the method include the following: (a) the analysis is conducted in the multidimensional space retaining the semantics of the data, (b) an iterative weighted sampling procedure is utilized to maintain modest computational complexity and to retain discrimination of extremely small subpopulations (hundreds of cells from datasets containing tens of millions), and (c) a splitting and merging procedure is incorporated in the algorithm to preserve distinguishability between biologically distinct populations, while still providing a significant compaction relative to the original data. This article presents a detailed algorithmic description of SWIFT, outlining the application-driven motivations for the different design choices, a discussion of computational complexity of the different steps, and results obtained with SWIFT for synthetic data and relatively simple experimental data that allow validation of the desirable attributes. A companion paper (Part 2) highlights the use of SWIFT, in combination with additional computational tools, for more challenging biological problems.

Keywords: Gaussian mixture models; automated multivariate clustering; ground truth data; rare subpopulation detection; weighted sampling.

Publication types

Research Support, U.S. Gov't, P.H.S.

MeSH terms

Algorithms*
Cell Lineage
Cluster Analysis*
Computational Biology
Flow Cytometry / methods*
Models, Theoretical

Abstract

Publication types

MeSH terms

Grants and funding