Comparison of Methods for Biological Sequence Clustering

Ze-Gang Wei; Xu Chen; Xiao-Dan Zhang; Hao Zhang; Xing-Guo Fan; Hong-Yan Gao; Fei Liu; Yu Qian

doi:10.1109/TCBB.2023.3253138

Comparison of Methods for Biological Sequence Clustering

IEEE/ACM Trans Comput Biol Bioinform. 2023 Sep-Oct;20(5):2874-2888. doi: 10.1109/TCBB.2023.3253138. Epub 2023 Oct 9.

Authors

Ze-Gang Wei, Xu Chen, Xiao-Dan Zhang, Hao Zhang, Xing-Guo Fan, Hong-Yan Gao, Fei Liu, Yu Qian

PMID: 37028305
DOI: 10.1109/TCBB.2023.3253138

Abstract

Recent advances in sequencing technology have considerably promoted genomics research by providing high-throughput sequencing economically. This great advancement has resulted in a huge amount of sequencing data. Clustering analysis is powerful to study and probe the large-scale sequence data. A number of available clustering methods have been developed in the last decade. Despite numerous comparison studies being published, we noticed that they have two main limitations: only traditional alignment-based clustering methods are compared and the evaluation metrics heavily rely on labeled sequence data. In this study, we present a comprehensive benchmark study for sequence clustering methods. Specifically, i) alignment-based clustering algorithms including classical (e.g., CD-HIT, UCLUST, VSEARCH) and recently proposed methods (e.g., MMseq2, Linclust, edClust) are assessed; ii) two alignment-free methods (e.g., LZW-Kernel and Mash) are included to compare with alignment-based methods; and iii) different evaluation measures based on the true labels (supervised metrics) and the input data itself (unsupervised metrics) are applied to quantify their clustering results. The aims of this study are to help biological analyzers in choosing one reasonable clustering algorithm for processing their collected sequences, and furthermore, motivate algorithm designers to develop more efficient sequence clustering approaches.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Algorithms*
Cluster Analysis
Genomics*
High-Throughput Nucleotide Sequencing