TIS Transformer: remapping the human proteome using deep learning

Jim Clauwaert; Zahra McVey; Ramneek Gupta; Gerben Menschaert

doi:10.1093/nargab/lqad021

TIS Transformer: remapping the human proteome using deep learning

NAR Genom Bioinform. 2023 Mar 3;5(1):lqad021. doi: 10.1093/nargab/lqad021. eCollection 2023 Mar.

Authors

Jim Clauwaert¹, Zahra McVey², Ramneek Gupta², Gerben Menschaert¹

Affiliations

¹ Department of Data Analysis and Mathematical Modelling, Ghent University, Ghent, Oost-Vlaanderen 9000, Belgium.
² Novo Nordisk Research Centre Oxford, Novo Nordisk Ltd., Crawley, South East England, RH6 0PA, UK.

Abstract

The correct mapping of the proteome is an important step towards advancing our understanding of biological systems and cellular mechanisms. Methods that provide better mappings can fuel important processes such as drug discovery and disease understanding. Currently, true determination of translation initiation sites is primarily achieved by in vivo experiments. Here, we propose TIS Transformer, a deep learning model for the determination of translation start sites solely utilizing the information embedded in the transcript nucleotide sequence. The method is built upon deep learning techniques first designed for natural language processing. We prove this approach to be best suited for learning the semantics of translation, outperforming previous approaches by a large margin. We demonstrate that limitations in the model performance are primarily due to the presence of low-quality annotations against which the model is evaluated against. Advantages of the method are its ability to detect key features of the translation process and multiple coding sequences on a transcript. These include micropeptides encoded by short Open Reading Frames, either alongside a canonical coding sequence or within long non-coding RNAs. To demonstrate the use of our methods, we applied TIS Transformer to remap the full human proteome.