High-Performance Cluster-Scalable Computational Methods for Genomics Applications

T. Ahmad

Research output: ThesisDissertation (TU Delft)

594 Downloads (Pure)

Abstract

The ever increasing pace of advancements in sequencing technologies has enabled rapid DNA/genome sequencing to become much more accessible. In particular, next (second) and third generation sequencing technologies offer high throughput, massively parallel and cost effective sequencing solutions. Individual sample sequencing data volumes as well as the number of assembled genomes are also growing quickly. These advances in high throughput sequencing technologies and demand for fast computational processing and downstream analysis of sequencing data in clinical settings is widening the gap between the time spent in sample collection and sequencing versus computational analysis.

To improve the scalability and performance optimizations of genome variant calling analysis workflows on modern computing systems, in this dissertation four potential research directions have been selected for further exploration. First, to exploit the performance of modern processors hardware features like multi-core and vector units on the GATK best practices variant calling pipelines, we introduce ArrowSAM, a columnar inmemory data format to place and process genomics data in-memory thus removing the need for repeated file storage accesses in intermediate variant calling pipeline applications. Our second contribution focuses on integration of the Apache Arrow based columnar in-memory data format in the PySpark API to enable exploiting the benefits of vectorized operations in the Python language using user-defined functions on Spark dataframes. For our third research contribution, we tested and benchmarked both the scalability and performance of Arrow Flight for client-server as well as cluster scaled communication.For our final research contribution reported in this dissertation, we implemented an orthogonal approach that is even more scalable than Apache Spark and Arrow Flight based solutions and offers flexibility to use many different variant callers.
Original languageEnglish
QualificationDoctor of Philosophy
Awarding Institution
  • Delft University of Technology
Supervisors/Advisors
  • Al-Ars, Z., Supervisor
  • Hofstee, H.P., Supervisor
Award date4 Jul 2022
DOIs
Publication statusPublished - 2022

Funding

Punjab Educational Endowment Fund (PEEF)

Keywords

  • Genomics
  • Variant Calling
  • Apache Arrow
  • Apache Spark
  • MPI

Fingerprint

Dive into the research topics of 'High-Performance Cluster-Scalable Computational Methods for Genomics Applications'. Together they form a unique fingerprint.

Cite this