Inference of phylogenetic trees directly from raw sequencing reads using Read2Tree

David Dylus; Adrian Altenhoff; Sina Majidian; Fritz J Sedlazeck; Christophe Dessimoz

doi:10.1038/s41587-023-01753-4

Inference of phylogenetic trees directly from raw sequencing reads using Read2Tree

Nat Biotechnol. 2024 Jan;42(1):139-147. doi: 10.1038/s41587-023-01753-4. Epub 2023 Apr 20.

Authors

David Dylus^{1

2

3}, Adrian Altenhoff^{2

4}, Sina Majidian^{1

2}, Fritz J Sedlazeck^{5

6}, Christophe Dessimoz^{7

8

9

10}

Affiliations

¹ Department of Computational Biology, University of Lausanne, Lausanne, Switzerland.
² SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland.
³ F. Hoffmann-La Roche Ltd, Immunology, Infectious Disease, and Ophthalmology (I2O), Roche Pharmaceutical Research and Early Development (pRED), Basel, Switzerland.
⁴ Department of Computer Science, ETH, Zurich, Switzerland.
⁵ Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA. Fritz.Sedlazeck@bcm.edu.
⁶ Department of Computer Science, Rice University, Houston, TX, USA. Fritz.Sedlazeck@bcm.edu.
⁷ Department of Computational Biology, University of Lausanne, Lausanne, Switzerland. Christophe.Dessimoz@unil.ch.
⁸ SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland. Christophe.Dessimoz@unil.ch.
⁹ Department of Computer Science, University College London, London, UK. Christophe.Dessimoz@unil.ch.
¹⁰ Centre for Life's Origins and Evolution, Department of Genetics, Evolution and Environment, University College London, London, UK. Christophe.Dessimoz@unil.ch.

Abstract

Current methods for inference of phylogenetic trees require running complex pipelines at substantial computational and labor costs, with additional constraints in sequencing coverage, assembly and annotation quality, especially for large datasets. To overcome these challenges, we present Read2Tree, which directly processes raw sequencing reads into groups of corresponding genes and bypasses traditional steps in phylogeny inference, such as genome assembly, annotation and all-versus-all sequence comparisons, while retaining accuracy. In a benchmark encompassing a broad variety of datasets, Read2Tree is 10-100 times faster than assembly-based approaches and in most cases more accurate-the exception being when sequencing coverage is high and reference species very distant. Here, to illustrate the broad applicability of the tool, we reconstruct a yeast tree of life of 435 species spanning 590 million years of evolution. We also apply Read2Tree to >10,000 Coronaviridae samples, accurately classifying highly diverse animal samples and near-identical severe acute respiratory syndrome coronavirus 2 sequences on a single tree. The speed, accuracy and versatility of Read2Tree enable comparative genomics at scale.

MeSH terms

Animals
Genomics* / methods
Phylogeny
Sequence Analysis

Abstract

MeSH terms

Grants and funding