WhatsHap: Haplotype Assembly for Future-Generation Sequencing Reads

M Patterson; T Marschall; N Pisanti; Leo van Iersel; L Stougie; GW Klau; A Schoenhuth

doi:10.1007/978-3-319-05269-4_19

WhatsHap: Haplotype Assembly for Future-Generation Sequencing Reads

M Patterson, T Marschall, N Pisanti, Leo van Iersel, L Stougie, GW Klau, A Schoenhuth

Research output: Chapter in Book/Conference proceedings/Edited volume › Conference contribution › Scientific › peer-review

26 Citations (Scopus)

Abstract

The human genome is diploid, that is each of its chromosomes comes in two copies. This requires to phase the single nucleotide polymorphisms (SNPs), that is, to assign them to the two copies, beyond just detecting them. The resulting haplotypes, lists of SNPs belonging to each copy, are crucial for downstream analyses in population genetics. Currently, statistical approaches, which avoid making use of direct read information, constitute the state-of-the-art. Haplotype assembly, which addresses phasing directly from sequencing reads, suffers from the fact that sequencing reads of the current generation are too short to serve the purposes of genome-wide phasing.
Future sequencing technologies, however, bear the promise to generate reads of lengths and error rates that allow to bridge all SNP positions in the genome at sufficient amounts of SNPs per read. Existing haplotype assembly approaches, however, profit precisely, in terms of computational complexity, from the limited length of current-generation reads, because their runtime is usually exponential in the number of SNPs per sequencing read. This implies that such approaches will not be able to exploit the benefits of long enough, future-generation reads.
Here, we suggest WhatsHap, a novel dynamic programming approach to haplotype assembly. It is the first approach that yields provably optimal solutions to the weighted minimum error correction (wMEC) problem in runtime linear in the number of SNPs per sequencing read, making it suitable for future-generation reads. WhatsHap is a fixed parameter tractable (FPT) approach with coverage as the parameter. We demonstrate that WhatsHap can handle datasets of coverage up to 20x, processing chromosomes on standard workstations in only 1-2 hours. Our simulation study shows that the quality of haplotypes assembled by WhatsHap significantly improves with increasing read length, both in terms of genome coverage as well as in terms of switch errors. The switch error rates we achieve in our simulations are superior to those obtained by state-of-the-art statistical phasers.

Original language	English
Title of host publication	Proceedings 18th Annual International Conference on Research in Computational Molecular Biology
Editors	R Sharan
Place of Publication	Cham
Publisher	Springer
Pages	237-249
Number of pages	13
DOIs	https://doi.org/10.1007/978-3-319-05269-4_19
Publication status	Published - 2014
Externally published	Yes
Event	RECOMB2014, Pittsburgh, PA, USA - Cham, Switzerland Duration: 2 Apr 2014 → 5 Apr 2014

Publication series

Name	Lecture Notes in Computer Science
Volume	8394
ISSN (Print)	0302-9743

Conference

Conference	RECOMB2014, Pittsburgh, PA, USA
Period	2/04/14 → 5/04/14

Access to Document

10.1007/978-3-319-05269-4_19

Cite this

Patterson, M., Marschall, T., Pisanti, N., van Iersel, L., Stougie, L., Klau, GW., & Schoenhuth, A. (2014). WhatsHap: Haplotype Assembly for Future-Generation Sequencing Reads. In R. Sharan (Ed.), Proceedings 18th Annual International Conference on Research in Computational Molecular Biology (pp. 237-249). (Lecture Notes in Computer Science; Vol. 8394). Springer. https://doi.org/10.1007/978-3-319-05269-4_19

@inproceedings{ff753a39b4f04333ae6ee3be9b09eaca,

title = "WhatsHap: Haplotype Assembly for Future-Generation Sequencing Reads",

abstract = "The human genome is diploid, that is each of its chromosomes comes in two copies. This requires to phase the single nucleotide polymorphisms (SNPs), that is, to assign them to the two copies, beyond just detecting them. The resulting haplotypes, lists of SNPs belonging to each copy, are crucial for downstream analyses in population genetics. Currently, statistical approaches, which avoid making use of direct read information, constitute the state-of-the-art. Haplotype assembly, which addresses phasing directly from sequencing reads, suffers from the fact that sequencing reads of the current generation are too short to serve the purposes of genome-wide phasing.Future sequencing technologies, however, bear the promise to generate reads of lengths and error rates that allow to bridge all SNP positions in the genome at sufficient amounts of SNPs per read. Existing haplotype assembly approaches, however, profit precisely, in terms of computational complexity, from the limited length of current-generation reads, because their runtime is usually exponential in the number of SNPs per sequencing read. This implies that such approaches will not be able to exploit the benefits of long enough, future-generation reads.Here, we suggest WhatsHap, a novel dynamic programming approach to haplotype assembly. It is the first approach that yields provably optimal solutions to the weighted minimum error correction (wMEC) problem in runtime linear in the number of SNPs per sequencing read, making it suitable for future-generation reads. WhatsHap is a fixed parameter tractable (FPT) approach with coverage as the parameter. We demonstrate that WhatsHap can handle datasets of coverage up to 20x, processing chromosomes on standard workstations in only 1-2 hours. Our simulation study shows that the quality of haplotypes assembled by WhatsHap significantly improves with increasing read length, both in terms of genome coverage as well as in terms of switch errors. The switch error rates we achieve in our simulations are superior to those obtained by state-of-the-art statistical phasers.",

author = "M Patterson and T Marschall and N Pisanti and {van Iersel}, Leo and L Stougie and GW Klau and A Schoenhuth",

year = "2014",

doi = "10.1007/978-3-319-05269-4_19",

language = "English",

series = "Lecture Notes in Computer Science",

publisher = "Springer",

pages = "237--249",

editor = "R Sharan",

booktitle = "Proceedings 18th Annual International Conference on Research in Computational Molecular Biology",

note = "RECOMB2014, Pittsburgh, PA, USA ; Conference date: 02-04-2014 Through 05-04-2014",

}

Patterson, M, Marschall, T, Pisanti, N, van Iersel, L , Stougie, L, Klau, GW & Schoenhuth, A 2014, WhatsHap: Haplotype Assembly for Future-Generation Sequencing Reads. in R Sharan (ed.), Proceedings 18th Annual International Conference on Research in Computational Molecular Biology. Lecture Notes in Computer Science, vol. 8394, Springer, Cham, pp. 237-249, RECOMB2014, Pittsburgh, PA, USA, 2/04/14. https://doi.org/10.1007/978-3-319-05269-4_19

WhatsHap: Haplotype Assembly for Future-Generation Sequencing Reads. / Patterson, M; Marschall, T; Pisanti, N et al.
Proceedings 18th Annual International Conference on Research in Computational Molecular Biology. ed. / R Sharan. Cham: Springer, 2014. p. 237-249 (Lecture Notes in Computer Science; Vol. 8394).

Research output: Chapter in Book/Conference proceedings/Edited volume › Conference contribution › Scientific › peer-review

TY - GEN

T1 - WhatsHap

T2 - RECOMB2014, Pittsburgh, PA, USA

AU - Patterson, M

AU - Marschall, T

AU - Pisanti, N

AU - van Iersel, Leo

AU - Stougie, L

AU - Klau, GW

AU - Schoenhuth, A

PY - 2014

Y1 - 2014

N2 - The human genome is diploid, that is each of its chromosomes comes in two copies. This requires to phase the single nucleotide polymorphisms (SNPs), that is, to assign them to the two copies, beyond just detecting them. The resulting haplotypes, lists of SNPs belonging to each copy, are crucial for downstream analyses in population genetics. Currently, statistical approaches, which avoid making use of direct read information, constitute the state-of-the-art. Haplotype assembly, which addresses phasing directly from sequencing reads, suffers from the fact that sequencing reads of the current generation are too short to serve the purposes of genome-wide phasing.Future sequencing technologies, however, bear the promise to generate reads of lengths and error rates that allow to bridge all SNP positions in the genome at sufficient amounts of SNPs per read. Existing haplotype assembly approaches, however, profit precisely, in terms of computational complexity, from the limited length of current-generation reads, because their runtime is usually exponential in the number of SNPs per sequencing read. This implies that such approaches will not be able to exploit the benefits of long enough, future-generation reads.Here, we suggest WhatsHap, a novel dynamic programming approach to haplotype assembly. It is the first approach that yields provably optimal solutions to the weighted minimum error correction (wMEC) problem in runtime linear in the number of SNPs per sequencing read, making it suitable for future-generation reads. WhatsHap is a fixed parameter tractable (FPT) approach with coverage as the parameter. We demonstrate that WhatsHap can handle datasets of coverage up to 20x, processing chromosomes on standard workstations in only 1-2 hours. Our simulation study shows that the quality of haplotypes assembled by WhatsHap significantly improves with increasing read length, both in terms of genome coverage as well as in terms of switch errors. The switch error rates we achieve in our simulations are superior to those obtained by state-of-the-art statistical phasers.

AB - The human genome is diploid, that is each of its chromosomes comes in two copies. This requires to phase the single nucleotide polymorphisms (SNPs), that is, to assign them to the two copies, beyond just detecting them. The resulting haplotypes, lists of SNPs belonging to each copy, are crucial for downstream analyses in population genetics. Currently, statistical approaches, which avoid making use of direct read information, constitute the state-of-the-art. Haplotype assembly, which addresses phasing directly from sequencing reads, suffers from the fact that sequencing reads of the current generation are too short to serve the purposes of genome-wide phasing.Future sequencing technologies, however, bear the promise to generate reads of lengths and error rates that allow to bridge all SNP positions in the genome at sufficient amounts of SNPs per read. Existing haplotype assembly approaches, however, profit precisely, in terms of computational complexity, from the limited length of current-generation reads, because their runtime is usually exponential in the number of SNPs per sequencing read. This implies that such approaches will not be able to exploit the benefits of long enough, future-generation reads.Here, we suggest WhatsHap, a novel dynamic programming approach to haplotype assembly. It is the first approach that yields provably optimal solutions to the weighted minimum error correction (wMEC) problem in runtime linear in the number of SNPs per sequencing read, making it suitable for future-generation reads. WhatsHap is a fixed parameter tractable (FPT) approach with coverage as the parameter. We demonstrate that WhatsHap can handle datasets of coverage up to 20x, processing chromosomes on standard workstations in only 1-2 hours. Our simulation study shows that the quality of haplotypes assembled by WhatsHap significantly improves with increasing read length, both in terms of genome coverage as well as in terms of switch errors. The switch error rates we achieve in our simulations are superior to those obtained by state-of-the-art statistical phasers.

U2 - 10.1007/978-3-319-05269-4_19

DO - 10.1007/978-3-319-05269-4_19

M3 - Conference contribution

T3 - Lecture Notes in Computer Science

SP - 237

EP - 249

BT - Proceedings 18th Annual International Conference on Research in Computational Molecular Biology

A2 - Sharan, R

PB - Springer

CY - Cham

Y2 - 2 April 2014 through 5 April 2014

ER -

WhatsHap: Haplotype Assembly for Future-Generation Sequencing Reads

Abstract

Publication series

Conference

Access to Document

Fingerprint

Cite this