Learning the language of life

Research output: Contribution to conferenceConference abstract for conferenceResearchpeer-review

Abstract

Machine learning models trained on protein data tend to underperform due to the low amount of annotated data. Current research has shown that Language Models (LM) trained on unlabeled protein sequences can be used to improve performance on protein prediction tasks. However, protein LMs have not been fully studied, and their full capabilities are yet to be explored. A protein LM can be defined as a model that predicts the next amino acid given the context previous to that amino acid. In this research, we focus on assembling a high-quality protein dataset suitable for protein language modelling and training a Recurrent Neural Language Model on this dataset. We show that the protein LM learns to predict the next amino acid in a sequence and creates amino acid representations that are context dependent. In addition, our protein LM is able to predict the probability of a protein sequence, being able to discriminate between real and fake proteins. Finally, we show that our model also can generate new protein sequences with similar features to real proteins.
Original languageEnglish
Publication date2019
Number of pages1
Publication statusPublished - 2019
EventISMB/ECCB 27th Conference on Intelligent Systems for Molecular Biology and the 18th European Conference on Computational Biology - Basel, Switzerland
Duration: 21 Jul 201925 Jul 2019
https://www.iscb.org/ismbeccb2019

Conference

ConferenceISMB/ECCB 27th Conference on Intelligent Systems for Molecular Biology and the 18th European Conference on Computational Biology
CountrySwitzerland
CityBasel
Period21/07/201925/07/2019
Internet address

Cite this

Almagro Armenteros, J. J., Johansen, A. R., Nielsen, H., & Winther, O. (2019). Learning the language of life. Abstract from ISMB/ECCB 27th Conference on Intelligent Systems for Molecular Biology and the 18th European Conference on Computational Biology, Basel, Switzerland.