SCHNEL: Scalable clustering of high dimensional single-cell data

Tamim Abdelaal; Paul  de Raadt; Boudewijn P.F.  Lelieveldt; Marcel J.T. Reinders; Ahmed Mahfouz

doi:10.1093/bioinformatics/btaa816

SCHNEL: Scalable clustering of high dimensional single-cell data

Tamim Abdelaal, Paul de Raadt, Boudewijn P.F. Lelieveldt, Marcel J.T. Reinders, Ahmed Mahfouz

Pattern Recognition and Bioinformatics

Research output: Contribution to journal › Article › Scientific › peer-review

3 Citations (Scopus)

Abstract

MOTIVATION: Single cell data measures multiple cellular markers at the single-cell level for thousands to millions of cells. Identification of distinct cell populations is a key step for further biological understanding, usually performed by clustering this data. Dimensionality reduction based clustering tools are either not scalable to large datasets containing millions of cells, or not fully automated requiring an initial manual estimation of the number of clusters. Graph clustering tools provide automated and reliable clustering for single cell data, but suffer heavily from scalability to large datasets. RESULTS: We developed SCHNEL, a scalable, reliable and automated clustering tool for high-dimensional single-cell data. SCHNEL transforms large high-dimensional data to a hierarchy of datasets containing subsets of data points following the original data manifold. The novel approach of SCHNEL combines this hierarchical representation of the data with graph clustering, making graph clustering scalable to millions of cells. Using seven different cytometry datasets, SCHNEL outperformed three popular clustering tools for cytometry data, and was able to produce meaningful clustering results for datasets of 3.5 and 17.2 million cells within workable time frames. In addition, we show that SCHNEL is a general clustering tool by applying it to single-cell RNA sequencing data, as well as a popular machine learning benchmark dataset MNIST. AVAILABILITY AND IMPLEMENTATION: Implementation is available on GitHub (https://github.com/biovault/SCHNELpy). All datasets used in this study are publicly available. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Original language	English
Pages (from-to)	i849-i856
Number of pages	8
Journal	Bioinformatics (Oxford, England)
Volume	36
Issue number	Issue Supplement 2
DOIs	https://doi.org/10.1093/bioinformatics/btaa816
Publication status	Published - 2020

Access to Document

10.1093/bioinformatics/btaa816

Cite this

@article{7d9a9c69dc9046aaa3496b1dafe8139f,

title = "SCHNEL: Scalable clustering of high dimensional single-cell data",

abstract = "MOTIVATION: Single cell data measures multiple cellular markers at the single-cell level for thousands to millions of cells. Identification of distinct cell populations is a key step for further biological understanding, usually performed by clustering this data. Dimensionality reduction based clustering tools are either not scalable to large datasets containing millions of cells, or not fully automated requiring an initial manual estimation of the number of clusters. Graph clustering tools provide automated and reliable clustering for single cell data, but suffer heavily from scalability to large datasets. RESULTS: We developed SCHNEL, a scalable, reliable and automated clustering tool for high-dimensional single-cell data. SCHNEL transforms large high-dimensional data to a hierarchy of datasets containing subsets of data points following the original data manifold. The novel approach of SCHNEL combines this hierarchical representation of the data with graph clustering, making graph clustering scalable to millions of cells. Using seven different cytometry datasets, SCHNEL outperformed three popular clustering tools for cytometry data, and was able to produce meaningful clustering results for datasets of 3.5 and 17.2 million cells within workable time frames. In addition, we show that SCHNEL is a general clustering tool by applying it to single-cell RNA sequencing data, as well as a popular machine learning benchmark dataset MNIST. AVAILABILITY AND IMPLEMENTATION: Implementation is available on GitHub (https://github.com/biovault/SCHNELpy). All datasets used in this study are publicly available. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.",

author = "Tamim Abdelaal and {de Raadt}, Paul and Lelieveldt, {Boudewijn P.F.} and Reinders, {Marcel J.T.} and Ahmed Mahfouz",

year = "2020",

doi = "10.1093/bioinformatics/btaa816",

language = "English",

volume = "36",

pages = "i849--i856",

journal = "Bioinformatics (Oxford, England)",

issn = "1367-4811",

publisher = "Oxford University Press",

number = "Issue Supplement 2",

}

TY - JOUR

T1 - SCHNEL

T2 - Scalable clustering of high dimensional single-cell data

AU - Abdelaal, Tamim

AU - de Raadt, Paul

AU - Lelieveldt, Boudewijn P.F.

AU - Reinders, Marcel J.T.

AU - Mahfouz, Ahmed

PY - 2020

Y1 - 2020

N2 - MOTIVATION: Single cell data measures multiple cellular markers at the single-cell level for thousands to millions of cells. Identification of distinct cell populations is a key step for further biological understanding, usually performed by clustering this data. Dimensionality reduction based clustering tools are either not scalable to large datasets containing millions of cells, or not fully automated requiring an initial manual estimation of the number of clusters. Graph clustering tools provide automated and reliable clustering for single cell data, but suffer heavily from scalability to large datasets. RESULTS: We developed SCHNEL, a scalable, reliable and automated clustering tool for high-dimensional single-cell data. SCHNEL transforms large high-dimensional data to a hierarchy of datasets containing subsets of data points following the original data manifold. The novel approach of SCHNEL combines this hierarchical representation of the data with graph clustering, making graph clustering scalable to millions of cells. Using seven different cytometry datasets, SCHNEL outperformed three popular clustering tools for cytometry data, and was able to produce meaningful clustering results for datasets of 3.5 and 17.2 million cells within workable time frames. In addition, we show that SCHNEL is a general clustering tool by applying it to single-cell RNA sequencing data, as well as a popular machine learning benchmark dataset MNIST. AVAILABILITY AND IMPLEMENTATION: Implementation is available on GitHub (https://github.com/biovault/SCHNELpy). All datasets used in this study are publicly available. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

AB - MOTIVATION: Single cell data measures multiple cellular markers at the single-cell level for thousands to millions of cells. Identification of distinct cell populations is a key step for further biological understanding, usually performed by clustering this data. Dimensionality reduction based clustering tools are either not scalable to large datasets containing millions of cells, or not fully automated requiring an initial manual estimation of the number of clusters. Graph clustering tools provide automated and reliable clustering for single cell data, but suffer heavily from scalability to large datasets. RESULTS: We developed SCHNEL, a scalable, reliable and automated clustering tool for high-dimensional single-cell data. SCHNEL transforms large high-dimensional data to a hierarchy of datasets containing subsets of data points following the original data manifold. The novel approach of SCHNEL combines this hierarchical representation of the data with graph clustering, making graph clustering scalable to millions of cells. Using seven different cytometry datasets, SCHNEL outperformed three popular clustering tools for cytometry data, and was able to produce meaningful clustering results for datasets of 3.5 and 17.2 million cells within workable time frames. In addition, we show that SCHNEL is a general clustering tool by applying it to single-cell RNA sequencing data, as well as a popular machine learning benchmark dataset MNIST. AVAILABILITY AND IMPLEMENTATION: Implementation is available on GitHub (https://github.com/biovault/SCHNELpy). All datasets used in this study are publicly available. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

UR - http://www.scopus.com/inward/record.url?scp=85099232232&partnerID=8YFLogxK

U2 - 10.1093/bioinformatics/btaa816

DO - 10.1093/bioinformatics/btaa816

M3 - Article

C2 - 33381821

AN - SCOPUS:85099232232

SN - 1367-4811

VL - 36

SP - i849-i856

JO - Bioinformatics (Oxford, England)

JF - Bioinformatics (Oxford, England)

IS - Issue Supplement 2

ER -

SCHNEL: Scalable clustering of high dimensional single-cell data

Abstract

Access to Document

Other files and links

Fingerprint

Single-cell Analysis from the perspective of how to Interact, Identify and Integrate cells

Cite this