Metric learning on expression data for gene function prediction

Stavros Makrodimitris; Marcel J.T. Reinders; Roeland C.H.J. van Ham

doi:10.1093/bioinformatics/btz731

Metric learning on expression data for gene function prediction

Stavros Makrodimitris, Marcel J.T. Reinders, Roeland C.H.J. van Ham

Pattern Recognition and Bioinformatics

Research output: Contribution to journal › Article › Scientific › peer-review

13 Citations (Scopus)

69 Downloads (Pure)

Abstract

Motivation: Co-expression of two genes across different conditions is indicative of their involvement in the same biological process. However, when using RNA-Seq datasets with many experimental conditions from diverse sources, only a subset of the experimental conditions is expected to be relevant for finding genes related to a particular Gene Ontology (GO) term. Therefore, we hypothesize that when the purpose is to find similarly functioning genes, the co-expression of genes should not be determined on all samples but only on those samples informative for the GO term of interest. Results: To address this, we developed Metric Learning for Co-expression (MLC), a fast algorithm that assigns a GO-term-specific weight to each expression sample. The goal is to obtain a weighted co-expression measure that is more suitable than the unweighted Pearson correlation for applying Guilt-By-Association-based function predictions. More specifically, if two genes are annotated with a given GO term, MLC tries to maximize their weighted co-expression and, in addition, if one of them is not annotated with that term, the weighted co-expression is minimized. Our experiments on publicly available Arabidopsis thaliana RNA-Seq data demonstrate that MLC outperforms standard Pearson correlation in term-centric performance. Moreover, our method is particularly good at more specific terms, which are the most interesting. Finally, by observing the sample weights for a particular GO term, one can identify which experiments are important for learning that term and potentially identify novel conditions that are relevant, as demonstrated by experiments in both A. thaliana and Pseudomonas Aeruginosa.

Original language	English
Pages (from-to)	1182-1190
Number of pages	9
Journal	Bioinformatics (Oxford, England)
Volume	36
Issue number	4
DOIs	https://doi.org/10.1093/bioinformatics/btz731
Publication status	Published - 2020

Access to Document

10.1093/bioinformatics/btz731

btz731Final published version, 1.09 MBLicence: CC BY

Cite this

@article{b985dab49a494a46bf0e19c57990717b,

title = "Metric learning on expression data for gene function prediction",

abstract = "Motivation: Co-expression of two genes across different conditions is indicative of their involvement in the same biological process. However, when using RNA-Seq datasets with many experimental conditions from diverse sources, only a subset of the experimental conditions is expected to be relevant for finding genes related to a particular Gene Ontology (GO) term. Therefore, we hypothesize that when the purpose is to find similarly functioning genes, the co-expression of genes should not be determined on all samples but only on those samples informative for the GO term of interest. Results: To address this, we developed Metric Learning for Co-expression (MLC), a fast algorithm that assigns a GO-term-specific weight to each expression sample. The goal is to obtain a weighted co-expression measure that is more suitable than the unweighted Pearson correlation for applying Guilt-By-Association-based function predictions. More specifically, if two genes are annotated with a given GO term, MLC tries to maximize their weighted co-expression and, in addition, if one of them is not annotated with that term, the weighted co-expression is minimized. Our experiments on publicly available Arabidopsis thaliana RNA-Seq data demonstrate that MLC outperforms standard Pearson correlation in term-centric performance. Moreover, our method is particularly good at more specific terms, which are the most interesting. Finally, by observing the sample weights for a particular GO term, one can identify which experiments are important for learning that term and potentially identify novel conditions that are relevant, as demonstrated by experiments in both A. thaliana and Pseudomonas Aeruginosa.",

author = "Stavros Makrodimitris and Reinders, {Marcel J.T.} and {van Ham}, {Roeland C.H.J.}",

year = "2020",

doi = "10.1093/bioinformatics/btz731",

language = "English",

volume = "36",

pages = "1182--1190",

journal = "Bioinformatics (Oxford, England)",

issn = "1367-4811",

publisher = "Oxford University Press",

number = "4",

}

TY - JOUR

T1 - Metric learning on expression data for gene function prediction

AU - Makrodimitris, Stavros

AU - Reinders, Marcel J.T.

AU - van Ham, Roeland C.H.J.

PY - 2020

Y1 - 2020

N2 - Motivation: Co-expression of two genes across different conditions is indicative of their involvement in the same biological process. However, when using RNA-Seq datasets with many experimental conditions from diverse sources, only a subset of the experimental conditions is expected to be relevant for finding genes related to a particular Gene Ontology (GO) term. Therefore, we hypothesize that when the purpose is to find similarly functioning genes, the co-expression of genes should not be determined on all samples but only on those samples informative for the GO term of interest. Results: To address this, we developed Metric Learning for Co-expression (MLC), a fast algorithm that assigns a GO-term-specific weight to each expression sample. The goal is to obtain a weighted co-expression measure that is more suitable than the unweighted Pearson correlation for applying Guilt-By-Association-based function predictions. More specifically, if two genes are annotated with a given GO term, MLC tries to maximize their weighted co-expression and, in addition, if one of them is not annotated with that term, the weighted co-expression is minimized. Our experiments on publicly available Arabidopsis thaliana RNA-Seq data demonstrate that MLC outperforms standard Pearson correlation in term-centric performance. Moreover, our method is particularly good at more specific terms, which are the most interesting. Finally, by observing the sample weights for a particular GO term, one can identify which experiments are important for learning that term and potentially identify novel conditions that are relevant, as demonstrated by experiments in both A. thaliana and Pseudomonas Aeruginosa.

AB - Motivation: Co-expression of two genes across different conditions is indicative of their involvement in the same biological process. However, when using RNA-Seq datasets with many experimental conditions from diverse sources, only a subset of the experimental conditions is expected to be relevant for finding genes related to a particular Gene Ontology (GO) term. Therefore, we hypothesize that when the purpose is to find similarly functioning genes, the co-expression of genes should not be determined on all samples but only on those samples informative for the GO term of interest. Results: To address this, we developed Metric Learning for Co-expression (MLC), a fast algorithm that assigns a GO-term-specific weight to each expression sample. The goal is to obtain a weighted co-expression measure that is more suitable than the unweighted Pearson correlation for applying Guilt-By-Association-based function predictions. More specifically, if two genes are annotated with a given GO term, MLC tries to maximize their weighted co-expression and, in addition, if one of them is not annotated with that term, the weighted co-expression is minimized. Our experiments on publicly available Arabidopsis thaliana RNA-Seq data demonstrate that MLC outperforms standard Pearson correlation in term-centric performance. Moreover, our method is particularly good at more specific terms, which are the most interesting. Finally, by observing the sample weights for a particular GO term, one can identify which experiments are important for learning that term and potentially identify novel conditions that are relevant, as demonstrated by experiments in both A. thaliana and Pseudomonas Aeruginosa.

UR - http://www.scopus.com/inward/record.url?scp=85080841256&partnerID=8YFLogxK

U2 - 10.1093/bioinformatics/btz731

DO - 10.1093/bioinformatics/btz731

M3 - Article

C2 - 31562759

AN - SCOPUS:85080841256

SN - 1367-4811

VL - 36

SP - 1182

EP - 1190

JO - Bioinformatics (Oxford, England)

JF - Bioinformatics (Oxford, England)

IS - 4

ER -

Metric learning on expression data for gene function prediction

Abstract

Access to Document

Other files and links

Fingerprint

Cite this