Principal component analysis of binary genomics data

Yipeng Song, Johan A. Westerhuis, Nanne Aben, Magali Michaut, Lodewyk Wessels, Age K. Smilde

Research output: Contribution to journalArticleScientificpeer-review

6 Citations (Scopus)


Genome-wide measurements of genetic and epigenetic alterations are generating more and more high-dimensional binary data. The special mathematical characteristics of binary data make the direct use of the classical principal component analysis (PCA) model to explore low-dimensional structures less obvious. Although there are several PCA alternatives for binary data in the psychometric, data analysis and machine learning literature, they are not well known to the bioinformatics community. Results: In this article, we introduce the motivation and rationale of some parametric and nonparametric versions of PCA specifically geared for binary data. Using both realistic simulations of binary data as well as mutation, CNA and methylation data of the Genomic Determinants of Sensitivity in Cancer 1000 (GDSC1000), the methods were explored for their performance with respect to finding the correct number of components, overfit, finding back the correct low-dimensional structure, variable importance, etc. The results show that if a low-dimensional structure exists in the data, that most of the methods can find it. When assuming a probabilistic generating process is underlying the data, we recommend to use the parametric logistic PCA model, while when such an assumption is not valid and the data are considered as given, the nonparametric Gifi model is recommended.
Original languageEnglish
Pages (from-to)1-13
Number of pages13
JournalBriefings in Bioinformatics
Publication statusPublished - 2017
Externally publishedYes


  • binary data
  • dimension reduction
  • logistic PCA
  • nonlinear PCA
  • optimal scaling
  • PCA


Dive into the research topics of 'Principal component analysis of binary genomics data'. Together they form a unique fingerprint.

Cite this