A comparison of automatic cell identification methods for single-cell RNA sequencing data

Tamim Abdelaal; Lieke C.M. Michielsen; Davy Cats; Dylan Hoogduin; Hailiang Mei; Marcel J.T. Reinders; Ahmed Mahfouz

doi:10.1186/s13059-019-1795-z

A comparison of automatic cell identification methods for single-cell RNA sequencing data

Tamim Abdelaal, Lieke C.M. Michielsen, Davy Cats, Dylan Hoogduin, Hailiang Mei, Marcel J.T. Reinders, Ahmed Mahfouz

Pattern Recognition and Bioinformatics

Research output: Contribution to journal › Article › Scientific › peer-review

296 Citations (Scopus)

178 Downloads (Pure)

Abstract

Background: Single-cell transcriptomics is rapidly advancing our understanding of the cellular composition of complex tissues and organisms. A major limitation in most analysis pipelines is the reliance on manual annotations to determine cell identities, which are time-consuming and irreproducible. The exponential growth in the number of cells and samples has prompted the adaptation and development of supervised classification methods for automatic cell identification. Results: Here, we benchmarked 22 classification methods that automatically assign cell identities including single-cell-specific and general-purpose classifiers. The performance of the methods is evaluated using 27 publicly available single-cell RNA sequencing datasets of different sizes, technologies, species, and levels of complexity. We use 2 experimental setups to evaluate the performance of each method for within dataset predictions (intra-dataset) and across datasets (inter-dataset) based on accuracy, percentage of unclassified cells, and computation time. We further evaluate the methods' sensitivity to the input features, number of cells per population, and their performance across different annotation levels and datasets. We find that most classifiers perform well on a variety of datasets with decreased accuracy for complex datasets with overlapping classes or deep annotations. The general-purpose support vector machine classifier has overall the best performance across the different experiments. Conclusions: We present a comprehensive evaluation of automatic cell identification methods for single-cell RNA sequencing data. All the code used for the evaluation is available on GitHub (https://github.com/tabdelaal/scRNAseq-Benchmark). Additionally, we provide a Snakemake workflow to facilitate the benchmarking and to support the extension of new methods and new datasets.

Original language	English
Article number	194
Pages (from-to)	1-19
Number of pages	19
Journal	Genome biology
Volume	20
Issue number	1
DOIs	https://doi.org/10.1186/s13059-019-1795-z
Publication status	Published - 2019

Keywords

Benchmark
Cell identity
Classification
scRNA-seq

Access to Document

10.1186/s13059-019-1795-z

s13059-019-1795-zFinal published version, 2.51 MBLicence: CC BY

Cite this

@article{d6a117dd0b33495b89131093ca49e023,

title = "A comparison of automatic cell identification methods for single-cell RNA sequencing data",

abstract = "Background: Single-cell transcriptomics is rapidly advancing our understanding of the cellular composition of complex tissues and organisms. A major limitation in most analysis pipelines is the reliance on manual annotations to determine cell identities, which are time-consuming and irreproducible. The exponential growth in the number of cells and samples has prompted the adaptation and development of supervised classification methods for automatic cell identification. Results: Here, we benchmarked 22 classification methods that automatically assign cell identities including single-cell-specific and general-purpose classifiers. The performance of the methods is evaluated using 27 publicly available single-cell RNA sequencing datasets of different sizes, technologies, species, and levels of complexity. We use 2 experimental setups to evaluate the performance of each method for within dataset predictions (intra-dataset) and across datasets (inter-dataset) based on accuracy, percentage of unclassified cells, and computation time. We further evaluate the methods' sensitivity to the input features, number of cells per population, and their performance across different annotation levels and datasets. We find that most classifiers perform well on a variety of datasets with decreased accuracy for complex datasets with overlapping classes or deep annotations. The general-purpose support vector machine classifier has overall the best performance across the different experiments. Conclusions: We present a comprehensive evaluation of automatic cell identification methods for single-cell RNA sequencing data. All the code used for the evaluation is available on GitHub (https://github.com/tabdelaal/scRNAseq-Benchmark). Additionally, we provide a Snakemake workflow to facilitate the benchmarking and to support the extension of new methods and new datasets.",

keywords = "Benchmark, Cell identity, Classification, scRNA-seq",

author = "Tamim Abdelaal and Michielsen, {Lieke C.M.} and Davy Cats and Dylan Hoogduin and Hailiang Mei and Reinders, {Marcel J.T.} and Ahmed Mahfouz",

year = "2019",

doi = "10.1186/s13059-019-1795-z",

language = "English",

volume = "20",

pages = "1--19",

journal = "Genome biology",

issn = "1474-760X",

publisher = "BioMed Central",

number = "1",

}

TY - JOUR

T1 - A comparison of automatic cell identification methods for single-cell RNA sequencing data

AU - Abdelaal, Tamim

AU - Michielsen, Lieke C.M.

AU - Cats, Davy

AU - Hoogduin, Dylan

AU - Mei, Hailiang

AU - Reinders, Marcel J.T.

AU - Mahfouz, Ahmed

PY - 2019

Y1 - 2019

N2 - Background: Single-cell transcriptomics is rapidly advancing our understanding of the cellular composition of complex tissues and organisms. A major limitation in most analysis pipelines is the reliance on manual annotations to determine cell identities, which are time-consuming and irreproducible. The exponential growth in the number of cells and samples has prompted the adaptation and development of supervised classification methods for automatic cell identification. Results: Here, we benchmarked 22 classification methods that automatically assign cell identities including single-cell-specific and general-purpose classifiers. The performance of the methods is evaluated using 27 publicly available single-cell RNA sequencing datasets of different sizes, technologies, species, and levels of complexity. We use 2 experimental setups to evaluate the performance of each method for within dataset predictions (intra-dataset) and across datasets (inter-dataset) based on accuracy, percentage of unclassified cells, and computation time. We further evaluate the methods' sensitivity to the input features, number of cells per population, and their performance across different annotation levels and datasets. We find that most classifiers perform well on a variety of datasets with decreased accuracy for complex datasets with overlapping classes or deep annotations. The general-purpose support vector machine classifier has overall the best performance across the different experiments. Conclusions: We present a comprehensive evaluation of automatic cell identification methods for single-cell RNA sequencing data. All the code used for the evaluation is available on GitHub (https://github.com/tabdelaal/scRNAseq-Benchmark). Additionally, we provide a Snakemake workflow to facilitate the benchmarking and to support the extension of new methods and new datasets.

AB - Background: Single-cell transcriptomics is rapidly advancing our understanding of the cellular composition of complex tissues and organisms. A major limitation in most analysis pipelines is the reliance on manual annotations to determine cell identities, which are time-consuming and irreproducible. The exponential growth in the number of cells and samples has prompted the adaptation and development of supervised classification methods for automatic cell identification. Results: Here, we benchmarked 22 classification methods that automatically assign cell identities including single-cell-specific and general-purpose classifiers. The performance of the methods is evaluated using 27 publicly available single-cell RNA sequencing datasets of different sizes, technologies, species, and levels of complexity. We use 2 experimental setups to evaluate the performance of each method for within dataset predictions (intra-dataset) and across datasets (inter-dataset) based on accuracy, percentage of unclassified cells, and computation time. We further evaluate the methods' sensitivity to the input features, number of cells per population, and their performance across different annotation levels and datasets. We find that most classifiers perform well on a variety of datasets with decreased accuracy for complex datasets with overlapping classes or deep annotations. The general-purpose support vector machine classifier has overall the best performance across the different experiments. Conclusions: We present a comprehensive evaluation of automatic cell identification methods for single-cell RNA sequencing data. All the code used for the evaluation is available on GitHub (https://github.com/tabdelaal/scRNAseq-Benchmark). Additionally, we provide a Snakemake workflow to facilitate the benchmarking and to support the extension of new methods and new datasets.

KW - Benchmark

KW - Cell identity

KW - Classification

KW - scRNA-seq

UR - http://www.scopus.com/inward/record.url?scp=85071972555&partnerID=8YFLogxK

U2 - 10.1186/s13059-019-1795-z

DO - 10.1186/s13059-019-1795-z

M3 - Article

C2 - 31500660

SN - 1474-760X

VL - 20

SP - 1

EP - 19

JO - Genome biology

JF - Genome biology

IS - 1

M1 - 194

ER -

A comparison of automatic cell identification methods for single-cell RNA sequencing data

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Single-cell Analysis from the perspective of how to Interact, Identify and Integrate cells

Cite this