Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences

Alexander Rives; Joshua Meier; Tom Sercu; Siddharth Goyal; Zeming Lin; Jason Liu; Demi Guo; Myle Ott; C Lawrence Zitnick; Jerry Ma; Rob Fergus

doi:10.1073/pnas.2016239118

Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences

Proc Natl Acad Sci U S A. 2021 Apr 13;118(15):e2016239118. doi: 10.1073/pnas.2016239118.

Authors

Alexander Rives^{1

2}, Joshua Meier³, Tom Sercu³, Siddharth Goyal³, Zeming Lin², Jason Liu³, Demi Guo⁴, Myle Ott³, C Lawrence Zitnick³, Jerry Ma^{5

6}, Rob Fergus²

Affiliations

¹ Facebook AI Research, New York, NY 10003; arives@cs.nyu.edu.
² Department of Computer Science, New York University, New York, NY 10012.
³ Facebook AI Research, New York, NY 10003.
⁴ Harvard University, Cambridge, MA 02138.
⁵ Booth School of Business, University of Chicago, Chicago, IL 60637.
⁶ Yale Law School, New Haven, CT 06511.

Abstract

In the field of artificial intelligence, a combination of scale in data and model capacity enabled by unsupervised learning has led to major advances in representation learning and statistical generation. In the life sciences, the anticipated growth of sequencing promises unprecedented data on natural sequence diversity. Protein language modeling at the scale of evolution is a logical step toward predictive and generative artificial intelligence for biology. To this end, we use unsupervised learning to train a deep contextual language model on 86 billion amino acids across 250 million protein sequences spanning evolutionary diversity. The resulting model contains information about biological properties in its representations. The representations are learned from sequence data alone. The learned representation space has a multiscale organization reflecting structure from the level of biochemical properties of amino acids to remote homology of proteins. Information about secondary and tertiary structure is encoded in the representations and can be identified by linear projections. Representation learning produces features that generalize across a range of applications, enabling state-of-the-art supervised prediction of mutational effect and secondary structure and improving state-of-the-art features for long-range contact prediction.

Keywords: deep learning; generative biology; protein language model; representation learning; synthetic biology.

Publication types

Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

Amino Acids / chemistry
Protein Conformation
Sequence Analysis, Protein / methods*
Sequence Homology, Amino Acid
Unsupervised Machine Learning*

Substances

Amino Acids