Power law tails in phylogenetic systems

Proc Natl Acad Sci U S A. 2018 Jan 23;115(4):690-695. doi: 10.1073/pnas.1711913115. Epub 2018 Jan 8.

Abstract

Covariance analysis of protein sequence alignments uses coevolving pairs of sequence positions to predict features of protein structure and function. However, current methods ignore the phylogenetic relationships between sequences, potentially corrupting the identification of covarying positions. Here, we use random matrix theory to demonstrate the existence of a power law tail that distinguishes the spectrum of covariance caused by phylogeny from that caused by structural interactions. The power law is essentially independent of the phylogenetic tree topology, depending on just two parameters-the sequence length and the average branch length. We demonstrate that these power law tails are ubiquitous in the large protein sequence alignments used to predict contacts in 3D structure, as predicted by our theory. This suggests that to decouple phylogenetic effects from the interactions between sequence distal sites that control biological function, it is necessary to remove or down-weight the eigenvectors of the covariance matrix with largest eigenvalues. We confirm that truncating these eigenvectors improves contact prediction.

Keywords: phylogeny; power law; protein; sequence covariance; structure prediction.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms
  • Evolution, Molecular
  • Models, Theoretical
  • Multivariate Analysis
  • Phylogeny
  • Proteins / chemistry
  • Sequence Alignment / methods*
  • Sequence Alignment / statistics & numerical data
  • Sequence Analysis, Protein / methods*

Substances

  • Proteins