Refine and recycle: A method to increase decompression parallelism

Jian Fang; Jianyu Chen; Jinho Lee; Zaid Al-Ars; H. Peter Hofstee

doi:10.1109/ASAP.2019.00017

Refine and recycle: A method to increase decompression parallelism

Jian Fang, Jianyu Chen, Jinho Lee, Zaid Al-Ars, H. Peter Hofstee

Computer Engineering

Research output: Chapter in Book/Conference proceedings/Edited volume › Conference contribution › Scientific › peer-review

6 Citations (Scopus)

Abstract

Rapid increases in storage bandwidth, combined with a desire for operating on large datasets interactively, drives the need for improvements in high-bandwidth decompression. Existing designs either process only one token per cycle or process multiple tokens per cycle with low area efficiency and/or low clock frequency. We propose two techniques to achieve high single-decoder throughput at improved efficiency by keeping only a single copy of the history data across multiple BRAMs and operating on each BRAM independently. A first stage efficiently refines the tokens into commands that operate on a single BRAM and steers the commands to the appropriate one. In the second stage, a relaxed execution model is used where each BRAM command executes immediately and those with invalid data are recycled to avoid stalls caused by the read-after-write dependency. We apply these techniques to Snappy decompression and implement a Snappy decompression accelerator on a CAPI2-attached FPGA platform equipped with a Xilinx VU3P FPGA. Experimental results show that our proposed method achieves up to 7.2 GB/s output throughput per decompressor, with each decompressor using 14.2% of the logic and 7% of the BRAM resources of the device. Therefore, a single decompressor can easily keep pace with an NVMe device (PCIe Gen3 x4) on a small FPGA, while a larger device, integrated on a host bridge adapter and instantiating multiple decompressors, can keep pace with the full OpenCAPI 3.0 bandwidth of 25 GB/s.

Original language	English
Title of host publication	2019 IEEE 30th International Conference on Application-specific Systems, Architectures and Processors (ASAP)
Subtitle of host publication	Proceedings
Publisher	IEEE
Pages	272-280
Number of pages	9
ISBN (Electronic)	978-1-7281-1601-3
ISBN (Print)	978-1-7281-1602-0
DOIs	https://doi.org/10.1109/ASAP.2019.00017
Publication status	Published - 2019
Event	30th IEEE International Conference on Application-Specific Systems, Architectures and Processors, ASAP 2019 - New York, United States Duration: 15 Jul 2019 → 17 Jul 2019

Publication series

Name	2019 IEEE 30TH INTERNATIONAL CONFERENCE ON APPLICATION-SPECIFIC SYSTEMS, ARCHITECTURES AND PROCESSORS (ASAP 2019)
ISSN (Print)	2160-0511

Conference

Conference	30th IEEE International Conference on Application-Specific Systems, Architectures and Processors, ASAP 2019
Country/Territory	United States
City	New York
Period	15/07/19 → 17/07/19

Keywords

Acceleration
Decompression
FPGA
Snappy
decompression

Access to Document

10.1109/ASAP.2019.00017

Cite this

Fang, J., Chen, J., Lee, J., Al-Ars, Z., & Hofstee, H. P. (2019). Refine and recycle: A method to increase decompression parallelism. In 2019 IEEE 30th International Conference on Application-specific Systems, Architectures and Processors (ASAP): Proceedings (pp. 272-280). Article 8825015 (2019 IEEE 30TH INTERNATIONAL CONFERENCE ON APPLICATION-SPECIFIC SYSTEMS, ARCHITECTURES AND PROCESSORS (ASAP 2019)). IEEE. https://doi.org/10.1109/ASAP.2019.00017

Fang, Jian ; Chen, Jianyu ; Lee, Jinho et al. / Refine and recycle : A method to increase decompression parallelism. 2019 IEEE 30th International Conference on Application-specific Systems, Architectures and Processors (ASAP): Proceedings. IEEE, 2019. pp. 272-280 (2019 IEEE 30TH INTERNATIONAL CONFERENCE ON APPLICATION-SPECIFIC SYSTEMS, ARCHITECTURES AND PROCESSORS (ASAP 2019)).

@inproceedings{994ce1fc01304885bbdce6874103dbe1,

title = "Refine and recycle: A method to increase decompression parallelism",

abstract = "Rapid increases in storage bandwidth, combined with a desire for operating on large datasets interactively, drives the need for improvements in high-bandwidth decompression. Existing designs either process only one token per cycle or process multiple tokens per cycle with low area efficiency and/or low clock frequency. We propose two techniques to achieve high single-decoder throughput at improved efficiency by keeping only a single copy of the history data across multiple BRAMs and operating on each BRAM independently. A first stage efficiently refines the tokens into commands that operate on a single BRAM and steers the commands to the appropriate one. In the second stage, a relaxed execution model is used where each BRAM command executes immediately and those with invalid data are recycled to avoid stalls caused by the read-after-write dependency. We apply these techniques to Snappy decompression and implement a Snappy decompression accelerator on a CAPI2-attached FPGA platform equipped with a Xilinx VU3P FPGA. Experimental results show that our proposed method achieves up to 7.2 GB/s output throughput per decompressor, with each decompressor using 14.2% of the logic and 7% of the BRAM resources of the device. Therefore, a single decompressor can easily keep pace with an NVMe device (PCIe Gen3 x4) on a small FPGA, while a larger device, integrated on a host bridge adapter and instantiating multiple decompressors, can keep pace with the full OpenCAPI 3.0 bandwidth of 25 GB/s.",

keywords = "Acceleration, Decompression, FPGA, Snappy, decompression",

author = "Jian Fang and Jianyu Chen and Jinho Lee and Zaid Al-Ars and Hofstee, {H. Peter}",

year = "2019",

doi = "10.1109/ASAP.2019.00017",

language = "English",

isbn = "978-1-7281-1602-0",

series = "2019 IEEE 30TH INTERNATIONAL CONFERENCE ON APPLICATION-SPECIFIC SYSTEMS, ARCHITECTURES AND PROCESSORS (ASAP 2019)",

publisher = "IEEE",

pages = "272--280",

booktitle = "2019 IEEE 30th International Conference on Application-specific Systems, Architectures and Processors (ASAP)",

address = "United States",

note = "30th IEEE International Conference on Application-Specific Systems, Architectures and Processors, ASAP 2019 ; Conference date: 15-07-2019 Through 17-07-2019",

}

Fang, J, Chen, J, Lee, J, Al-Ars, Z & Hofstee, HP 2019, Refine and recycle: A method to increase decompression parallelism. in 2019 IEEE 30th International Conference on Application-specific Systems, Architectures and Processors (ASAP): Proceedings., 8825015, 2019 IEEE 30TH INTERNATIONAL CONFERENCE ON APPLICATION-SPECIFIC SYSTEMS, ARCHITECTURES AND PROCESSORS (ASAP 2019), IEEE, pp. 272-280, 30th IEEE International Conference on Application-Specific Systems, Architectures and Processors, ASAP 2019, New York, United States, 15/07/19. https://doi.org/10.1109/ASAP.2019.00017

Refine and recycle: A method to increase decompression parallelism. / Fang, Jian; Chen, Jianyu; Lee, Jinho et al.
2019 IEEE 30th International Conference on Application-specific Systems, Architectures and Processors (ASAP): Proceedings. IEEE, 2019. p. 272-280 8825015 (2019 IEEE 30TH INTERNATIONAL CONFERENCE ON APPLICATION-SPECIFIC SYSTEMS, ARCHITECTURES AND PROCESSORS (ASAP 2019)).

Research output: Chapter in Book/Conference proceedings/Edited volume › Conference contribution › Scientific › peer-review

TY - GEN

T1 - Refine and recycle

T2 - 30th IEEE International Conference on Application-Specific Systems, Architectures and Processors, ASAP 2019

AU - Fang, Jian

AU - Chen, Jianyu

AU - Lee, Jinho

AU - Al-Ars, Zaid

AU - Hofstee, H. Peter

PY - 2019

Y1 - 2019

N2 - Rapid increases in storage bandwidth, combined with a desire for operating on large datasets interactively, drives the need for improvements in high-bandwidth decompression. Existing designs either process only one token per cycle or process multiple tokens per cycle with low area efficiency and/or low clock frequency. We propose two techniques to achieve high single-decoder throughput at improved efficiency by keeping only a single copy of the history data across multiple BRAMs and operating on each BRAM independently. A first stage efficiently refines the tokens into commands that operate on a single BRAM and steers the commands to the appropriate one. In the second stage, a relaxed execution model is used where each BRAM command executes immediately and those with invalid data are recycled to avoid stalls caused by the read-after-write dependency. We apply these techniques to Snappy decompression and implement a Snappy decompression accelerator on a CAPI2-attached FPGA platform equipped with a Xilinx VU3P FPGA. Experimental results show that our proposed method achieves up to 7.2 GB/s output throughput per decompressor, with each decompressor using 14.2% of the logic and 7% of the BRAM resources of the device. Therefore, a single decompressor can easily keep pace with an NVMe device (PCIe Gen3 x4) on a small FPGA, while a larger device, integrated on a host bridge adapter and instantiating multiple decompressors, can keep pace with the full OpenCAPI 3.0 bandwidth of 25 GB/s.

AB - Rapid increases in storage bandwidth, combined with a desire for operating on large datasets interactively, drives the need for improvements in high-bandwidth decompression. Existing designs either process only one token per cycle or process multiple tokens per cycle with low area efficiency and/or low clock frequency. We propose two techniques to achieve high single-decoder throughput at improved efficiency by keeping only a single copy of the history data across multiple BRAMs and operating on each BRAM independently. A first stage efficiently refines the tokens into commands that operate on a single BRAM and steers the commands to the appropriate one. In the second stage, a relaxed execution model is used where each BRAM command executes immediately and those with invalid data are recycled to avoid stalls caused by the read-after-write dependency. We apply these techniques to Snappy decompression and implement a Snappy decompression accelerator on a CAPI2-attached FPGA platform equipped with a Xilinx VU3P FPGA. Experimental results show that our proposed method achieves up to 7.2 GB/s output throughput per decompressor, with each decompressor using 14.2% of the logic and 7% of the BRAM resources of the device. Therefore, a single decompressor can easily keep pace with an NVMe device (PCIe Gen3 x4) on a small FPGA, while a larger device, integrated on a host bridge adapter and instantiating multiple decompressors, can keep pace with the full OpenCAPI 3.0 bandwidth of 25 GB/s.

KW - Acceleration

KW - Decompression

KW - FPGA

KW - Snappy

KW - decompression

UR - http://www.scopus.com/inward/record.url?scp=85072601520&partnerID=8YFLogxK

U2 - 10.1109/ASAP.2019.00017

DO - 10.1109/ASAP.2019.00017

M3 - Conference contribution

AN - SCOPUS:85072601520

SN - 978-1-7281-1602-0

T3 - 2019 IEEE 30TH INTERNATIONAL CONFERENCE ON APPLICATION-SPECIFIC SYSTEMS, ARCHITECTURES AND PROCESSORS (ASAP 2019)

SP - 272

EP - 280

BT - 2019 IEEE 30th International Conference on Application-specific Systems, Architectures and Processors (ASAP)

PB - IEEE

Y2 - 15 July 2019 through 17 July 2019

ER -

Fang J, Chen J, Lee J, Al-Ars Z , Hofstee HP. Refine and recycle: A method to increase decompression parallelism. In 2019 IEEE 30th International Conference on Application-specific Systems, Architectures and Processors (ASAP): Proceedings. IEEE. 2019. p. 272-280. 8825015. (2019 IEEE 30TH INTERNATIONAL CONFERENCE ON APPLICATION-SPECIFIC SYSTEMS, ARCHITECTURES AND PROCESSORS (ASAP 2019)). doi: 10.1109/ASAP.2019.00017

Refine and recycle: A method to increase decompression parallelism

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this