The Power of Universal Contextualized Protein Embeddings in Cross-species Protein Function Prediction

Irene van den Bent; Stavros Makrodimitris; Marcel Reinders

doi:10.1177/11769343211062608

The Power of Universal Contextualized Protein Embeddings in Cross-species Protein Function Prediction

Irene van den Bent, Stavros Makrodimitris, Marcel Reinders^*

^*Corresponding author for this work

Pattern Recognition and Bioinformatics

Research output: Contribution to journal › Article › Scientific › peer-review

3 Citations (Scopus)

24 Downloads (Pure)

Abstract

Computationally annotating proteins with a molecular function is a difficult problem that is made even harder due to the limited amount of available labeled protein training data. Unsupervised protein embeddings partly circumvent this limitation by learning a universal protein representation from many unlabeled sequences. Such embeddings incorporate contextual information of amino acids, thereby modeling the underlying principles of protein sequences insensitive to the context of species. We used an existing pre-trained protein embedding method and subjected its molecular function prediction performance to detailed characterization, first to advance the understanding of protein language models, and second to determine areas of improvement. Then, we applied the model in a transfer learning task by training a function predictor based on the embeddings of annotated protein sequences of one training species and making predictions on the proteins of several test species with varying evolutionary distance. We show that this approach successfully generalizes knowledge about protein function from one eukaryotic species to various other species, outperforming both an alignment-based and a supervised-learning-based baseline. This implies that such a method could be effective for molecular function prediction in inadequately annotated species from understudied taxonomic kingdoms.

Original language	English
Number of pages	15
Journal	Evolutionary Bioinformatics
Volume	17
DOIs	https://doi.org/10.1177/11769343211062608
Publication status	Published - 2021

Keywords

annotating evolutionary distant proteins
protein embedding
Protein function prediction
protein language models
transfer learning

Access to Document

10.1177/11769343211062608

11769343211062608Final published version, 5.03 MBLicence: CC BY

Cite this

@article{c76cd811d4d94d02af8ccfe850394462,

title = "The Power of Universal Contextualized Protein Embeddings in Cross-species Protein Function Prediction",

abstract = "Computationally annotating proteins with a molecular function is a difficult problem that is made even harder due to the limited amount of available labeled protein training data. Unsupervised protein embeddings partly circumvent this limitation by learning a universal protein representation from many unlabeled sequences. Such embeddings incorporate contextual information of amino acids, thereby modeling the underlying principles of protein sequences insensitive to the context of species. We used an existing pre-trained protein embedding method and subjected its molecular function prediction performance to detailed characterization, first to advance the understanding of protein language models, and second to determine areas of improvement. Then, we applied the model in a transfer learning task by training a function predictor based on the embeddings of annotated protein sequences of one training species and making predictions on the proteins of several test species with varying evolutionary distance. We show that this approach successfully generalizes knowledge about protein function from one eukaryotic species to various other species, outperforming both an alignment-based and a supervised-learning-based baseline. This implies that such a method could be effective for molecular function prediction in inadequately annotated species from understudied taxonomic kingdoms.",

keywords = "annotating evolutionary distant proteins, protein embedding, Protein function prediction, protein language models, transfer learning",

author = "{van den Bent}, Irene and Stavros Makrodimitris and Marcel Reinders",

year = "2021",

doi = "10.1177/11769343211062608",

language = "English",

volume = "17",

journal = "Evolutionary Bioinformatics",

issn = "1176-9343",

publisher = "SAGE Publishing",

}

TY - JOUR

T1 - The Power of Universal Contextualized Protein Embeddings in Cross-species Protein Function Prediction

AU - van den Bent, Irene

AU - Makrodimitris, Stavros

AU - Reinders, Marcel

PY - 2021

Y1 - 2021

N2 - Computationally annotating proteins with a molecular function is a difficult problem that is made even harder due to the limited amount of available labeled protein training data. Unsupervised protein embeddings partly circumvent this limitation by learning a universal protein representation from many unlabeled sequences. Such embeddings incorporate contextual information of amino acids, thereby modeling the underlying principles of protein sequences insensitive to the context of species. We used an existing pre-trained protein embedding method and subjected its molecular function prediction performance to detailed characterization, first to advance the understanding of protein language models, and second to determine areas of improvement. Then, we applied the model in a transfer learning task by training a function predictor based on the embeddings of annotated protein sequences of one training species and making predictions on the proteins of several test species with varying evolutionary distance. We show that this approach successfully generalizes knowledge about protein function from one eukaryotic species to various other species, outperforming both an alignment-based and a supervised-learning-based baseline. This implies that such a method could be effective for molecular function prediction in inadequately annotated species from understudied taxonomic kingdoms.

AB - Computationally annotating proteins with a molecular function is a difficult problem that is made even harder due to the limited amount of available labeled protein training data. Unsupervised protein embeddings partly circumvent this limitation by learning a universal protein representation from many unlabeled sequences. Such embeddings incorporate contextual information of amino acids, thereby modeling the underlying principles of protein sequences insensitive to the context of species. We used an existing pre-trained protein embedding method and subjected its molecular function prediction performance to detailed characterization, first to advance the understanding of protein language models, and second to determine areas of improvement. Then, we applied the model in a transfer learning task by training a function predictor based on the embeddings of annotated protein sequences of one training species and making predictions on the proteins of several test species with varying evolutionary distance. We show that this approach successfully generalizes knowledge about protein function from one eukaryotic species to various other species, outperforming both an alignment-based and a supervised-learning-based baseline. This implies that such a method could be effective for molecular function prediction in inadequately annotated species from understudied taxonomic kingdoms.

KW - annotating evolutionary distant proteins

KW - protein embedding

KW - Protein function prediction

KW - protein language models

KW - transfer learning

UR - http://www.scopus.com/inward/record.url?scp=85120520251&partnerID=8YFLogxK

U2 - 10.1177/11769343211062608

DO - 10.1177/11769343211062608

M3 - Article

AN - SCOPUS:85120520251

SN - 1176-9343

VL - 17

JO - Evolutionary Bioinformatics

JF - Evolutionary Bioinformatics

ER -

The Power of Universal Contextualized Protein Embeddings in Cross-species Protein Function Prediction

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this