TY - JOUR
T1 - Probabilistic partial least squares model
T2 - Identifiability, estimation and application
AU - el Bouhaddani, Said
AU - Uh, Hae Won
AU - Hayward, Caroline
AU - Jongbloed, Geurt
AU - Houwing-Duistermaat, Jeanine
N1 - Accepted Author Manuscript
PY - 2018
Y1 - 2018
N2 - With a rapid increase in volume and complexity of data sets, there is a need for methods that can extract useful information, for example the relationship between two data sets measured for the same persons. The Partial Least Squares (PLS) method can be used for this dimension reduction task. Within life sciences, results across studies are compared and combined. Therefore, parameters need to be identifiable, which is not the case for PLS. In addition, PLS is an algorithm, while epidemiological study designs are often outcome-dependent and methods to analyze such data require a probabilistic formulation. Moreover, a probabilistic model provides a statistical framework for inference. To address these issues, we develop Probabilistic PLS (PPLS). We derive maximum likelihood estimators that satisfy the identifiability conditions by using an EM algorithm with a constrained optimization in the M step. We show that the PPLS parameters are identifiable up to sign. A simulation study is conducted to study the performance of PPLS compared to existing methods. The PPLS estimates performed well in various scenarios, even in high dimensions. Most notably, the estimates seem to be robust against departures from normality. To illustrate our method, we applied it to IgG glycan data from two cohorts. Our PPLS model provided insight as well as interpretable results across the two cohorts.
AB - With a rapid increase in volume and complexity of data sets, there is a need for methods that can extract useful information, for example the relationship between two data sets measured for the same persons. The Partial Least Squares (PLS) method can be used for this dimension reduction task. Within life sciences, results across studies are compared and combined. Therefore, parameters need to be identifiable, which is not the case for PLS. In addition, PLS is an algorithm, while epidemiological study designs are often outcome-dependent and methods to analyze such data require a probabilistic formulation. Moreover, a probabilistic model provides a statistical framework for inference. To address these issues, we develop Probabilistic PLS (PPLS). We derive maximum likelihood estimators that satisfy the identifiability conditions by using an EM algorithm with a constrained optimization in the M step. We show that the PPLS parameters are identifiable up to sign. A simulation study is conducted to study the performance of PPLS compared to existing methods. The PPLS estimates performed well in various scenarios, even in high dimensions. Most notably, the estimates seem to be robust against departures from normality. To illustrate our method, we applied it to IgG glycan data from two cohorts. Our PPLS model provided insight as well as interpretable results across the two cohorts.
KW - Dimension reduction
KW - EM algorithm
KW - Identifiability
KW - Inference
KW - Probabilistic partial least squares
UR - http://www.scopus.com/inward/record.url?scp=85048803529&partnerID=8YFLogxK
UR - http://resolver.tudelft.nl/uuid:eb1256ff-9878-4c0e-a94c-6da03f4bfed1
U2 - 10.1016/j.jmva.2018.05.009
DO - 10.1016/j.jmva.2018.05.009
M3 - Article
AN - SCOPUS:85048803529
SN - 0047-259X
VL - 167
SP - 331
EP - 346
JO - Journal of Multivariate Analysis
JF - Journal of Multivariate Analysis
ER -