Identification of protein coding regions in RNA transcripts

Shiyuyun Tang; Alexandre Lomsadze; Mark Borodovsky

doi:10.1093/nar/gkv227

Identification of protein coding regions in RNA transcripts

Nucleic Acids Res. 2015 Jul 13;43(12):e78. doi: 10.1093/nar/gkv227. Epub 2015 Apr 13.

Authors

Shiyuyun Tang¹, Alexandre Lomsadze², Mark Borodovsky³

Affiliations

¹ School of Biology, Georgia Institute of Technology, Atlanta, GA 30332, USA.
² Joint Georgia Tech and Emory Wallace H. Coulter Department of Biomedical Engineering, Georgia Institute of Technology, Atlanta, GA 30332, USA.
³ Joint Georgia Tech and Emory Wallace H. Coulter Department of Biomedical Engineering, Georgia Institute of Technology, Atlanta, GA 30332, USA School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, GA 30332, USA Center for Bioinformatics and Computational Genomics, Georgia Institute of Technology, Atlanta, GA 30332, USA Department of Biological and Medical Physics, Moscow Institute of Physics and Technology, Moscow, Russia borodovsky@gatech.edu.

Abstract

Massive parallel sequencing of RNA transcripts by next-generation technology (RNA-Seq) generates critically important data for eukaryotic gene discovery. Gene finding in transcripts can be done by statistical (alignment-free) as well as by alignment-based methods. We describe a new tool, GeneMarkS-T, for ab initio identification of protein-coding regions in RNA transcripts. The algorithm parameters are estimated by unsupervised training which makes unnecessary manually curated preparation of training sets. We demonstrate that (i) the unsupervised training is robust with respect to the presence of transcripts assembly errors and (ii) the accuracy of GeneMarkS-T in identifying protein-coding regions and, particularly, in predicting translation initiation sites in modelled as well as in assembled transcripts compares favourably to other existing methods.

Publication types

Research Support, N.I.H., Extramural

MeSH terms

Algorithms
Animals
Arabidopsis / genetics
Drosophila melanogaster / genetics
Gene Expression Profiling*
Genes
High-Throughput Nucleotide Sequencing / methods*
Mice
Open Reading Frames*
Peptide Chain Initiation, Translational
RNA, Messenger / chemistry
Schizosaccharomyces / genetics
Sequence Analysis, RNA / methods*
Software*

Substances

RNA, Messenger

Grants and funding

HG000783/HG/NHGRI NIH HHS/United States