ArrowSAM: In-Memory Genomics Data Processing Using Apache Arrow

Tanveer Ahmad, Nauman Ahmed, Johan Peltenburg, Zaid Al-Ars

Research output: Chapter in Book/Conference proceedings/Edited volumeConference contributionScientificpeer-review

5 Citations (Scopus)
226 Downloads (Pure)

Abstract

The rapidly growing size of genomics data bases, driven by advances in sequencing technologies, demands fast and cost-effective processing. However, processing this data creates many challenges, particularly in selecting appropriate algorithms and computing platforms. Computing systems need data closer to the processor for fast processing. Traditionally, due to cost, volatility and other physical constraints of DRAM, it was not feasible to place large amounts of working data sets in memory. However, new emerging storage class memories allow storing and processing big data closer to the processor. In this work, we show how the commonly used genomics data format, Sequence Alignment/Map (SAM), can be presented in the Apache Arrow in-memory data representation to benefit of in-memory processing and to ensure better scalability through shared memory objects, by avoiding large (de)-serialization overheads in cross-language interoperability. To demonstrate the benefits of such a system, we propose ArrowSAM, an in-memory SAM format that uses the Apache Arrow framework, and integrate it into genome pre-processing pipelines including BWA-MEM, Picard and Sambamba. Results show 15x and 2.4x speedups as compared to Picard and Sambamba, respectively. The code and scripts for running all workflows are freely available at https://github.com/abs-tudelft/ArrowSAM.

Original languageEnglish
Title of host publication2020 3rd International Conference on Computer Applications & Information Security (ICCAIS)
Subtitle of host publicationProceedings
PublisherIEEE
Pages1-6
Number of pages6
ISBN (Electronic)978-1-7281-4213-5
ISBN (Print)978-1-7281-4214-2
DOIs
Publication statusPublished - 2020
Event3rd International Conference on Computer Applications and Information Security, ICCAIS 2020 - Riyadh, Saudi Arabia
Duration: 19 Mar 202021 Mar 2020

Conference

Conference3rd International Conference on Computer Applications and Information Security, ICCAIS 2020
Country/TerritorySaudi Arabia
CityRiyadh
Period19/03/2021/03/20

Bibliographical note

Accepted author manuscript

Keywords

  • Apache Arrow
  • Big Data
  • Genomics
  • In-Memory
  • Parallel Processing
  • Whole Genome/Exome Sequencing

Fingerprint

Dive into the research topics of 'ArrowSAM: In-Memory Genomics Data Processing Using Apache Arrow'. Together they form a unique fingerprint.

Cite this