Petabase-Scale Homology Search for Structure Prediction

Sewon Lee; Gyuri Kim; Eli Levy Karin; Milot Mirdita; Sukhwan Park; Rayan Chikhi; Artem Babaian; Andriy Kryshtafovych; Martin Steinegger

doi:10.1101/cshperspect.a041465

Petabase-Scale Homology Search for Structure Prediction

Cold Spring Harb Perspect Biol. 2024 May 2;16(5):a041465. doi: 10.1101/cshperspect.a041465.

Authors

Sewon Lee^#¹, Gyuri Kim^#¹, Eli Levy Karin², Milot Mirdita¹, Sukhwan Park³, Rayan Chikhi⁴, Artem Babaian^{5

6}, Andriy Kryshtafovych⁷, Martin Steinegger^{8

3

9

10}

Affiliations

¹ School of Biological Sciences, Seoul National University, Gwanak-gu, Seoul 08826, South Korea.
² ELKMO, Copenhagen 2720, Denmark.
³ Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul 08826, South Korea.
⁴ Institut Pasteur, Université Paris Cité, G5 Sequence Bioinformatics, 75015 Paris, France.
⁵ Department of Molecular Genetics, University of Toronto, Toronto, Ontario M5S 1A8, Canada.
⁶ Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario M5S 3E1, Canada.
⁷ Genome Center, University of California, Davis, California 95616, USA.
⁸ School of Biological Sciences, Seoul National University, Gwanak-gu, Seoul 08826, South Korea martin.steinegger@snu.ac.kr.
⁹ Artificial Intelligence Institute, Seoul National University, Seoul 08826, South Korea.
¹⁰ Institute of Molecular Biology and Genetics, Seoul National University, Seoul 08826, South Korea.

^# Contributed equally.

PMID: 38316555
PMCID: PMC11065157 (available on 2026-05-01)
DOI: 10.1101/cshperspect.a041465

Abstract

The recent CASP15 competition highlighted the critical role of multiple sequence alignments (MSAs) in protein structure prediction, as demonstrated by the success of the top AlphaFold2-based prediction methods. To push the boundaries of MSA utilization, we conducted a petabase-scale search of the Sequence Read Archive (SRA), resulting in gigabytes of aligned homologs for CASP15 targets. These were merged with default MSAs produced by ColabFold-search and provided to ColabFold-predict. By using SRA data, we achieved highly accurate predictions (GDT_TS > 70) for 66% of the non-easy targets, whereas using ColabFold-search default MSAs scored highly in only 52%. Next, we tested the effect of deep homology search and ColabFold's advanced features, such as more recycles, on prediction accuracy. While SRA homologs were most significant for improving ColabFold's CASP15 ranking from 11th to 3rd place, other strategies contributed too. We analyze these in the context of existing strategies to improve prediction.

Publication types

Review

MeSH terms

Algorithms
Computational Biology* / methods
Protein Conformation
Proteins* / chemistry
Sequence Alignment
Sequence Analysis, Protein / methods
Software

Substances

Proteins

Grants and funding

R01 GM100482/GM/NIGMS NIH HHS/United States