Streaming Distributed DNA Sequence Alignment Using Apache Spark

Hamid Mushtaq, Nauman Ahmed, Zaid Al-Ars

Research output: Chapter in Book/Conference proceedings/Edited volumeConference contributionScientificpeer-review

6 Citations (Scopus)


The large amount of data generated by NextGeneration Sequencing (NGS) technology, usually in the order of hundreds of gigabytes per experiment, has to be analyzed quickly to generate meaningful variant results. The first step
in analyzing such data is to map those sequenced reads to their corresponding positions in the human genome. One of the most popular tools to do such sequence alignment is the Burrows-Wheeler Aligner (BWA mem). One limitation of the BWA program though is that it cannot be run on a cluster.
In this paper, we propose StreamBWA, a new framework that allows the BWA mem program to run on a cluster in a distributed fashion, at the same time while the input data is being streamed into the cluster. It can process the input
data directly from a compressed file, which either lies on the local file system or on a URL. Moreover, StreamBWA can start combining the output files of the distributed BWA mem tasks at the same time while these tasks are still being executed on the cluster. Empirical evaluation shows that this streaming
distributed approach is approximately 2x faster than the nonstreaming approach. Furthermore, our streaming distributed approach is 5x faster than other state-of-the-art solutions such as SparkBWA. The source code of StreamBWA is publicly available at
Original languageEnglish
Title of host publication2017 IEEE 17th International Conference on BioInformatics and BioEngineering (BIBE)
Place of PublicationPiscataway
Number of pages6
ISBN (Electronic)978-1-5386-1324-5
ISBN (Print)978-1-5386-1325-2
Publication statusPublished - 2017
EventBIBE 2017: 17th IEEE International Conference on BioInformatics and BioEngineering - Washington DC, United States
Duration: 23 Oct 201725 Oct 2017


ConferenceBIBE 2017
Abbreviated titleBIBE 2017
CountryUnited States
CityWashington DC
Internet address


  • DNA
  • Micromechanical devices
  • Pipelines
  • Tools
  • Sparks
  • Big Data


Dive into the research topics of 'Streaming Distributed DNA Sequence Alignment Using Apache Spark'. Together they form a unique fingerprint.

Cite this