pLUTo: Enabling Massively Parallel Computation in DRAM via Lookup Tables

Joao Dinis Ferreira; Gabriel Falcao; Juan Gomez-Luna; Mohammed Alser; Lois Orosa; Mohammad Sadrosadati; Jeremie S. Kim; Geraldo F. Oliveira; Taha Shahroodi; null More Authors

doi:10.1109/MICRO56248.2022.00067

pLUTo: Enabling Massively Parallel Computation in DRAM via Lookup Tables

Joao Dinis Ferreira, Gabriel Falcao, Juan Gomez-Luna, Mohammed Alser, Lois Orosa, Mohammad Sadrosadati, Jeremie S. Kim, Geraldo F. Oliveira, Taha Shahroodi, More Authors

Computer Engineering

Research output: Chapter in Book/Conference proceedings/Edited volume › Conference contribution › Scientific › peer-review

13 Citations (Scopus)

35 Downloads (Pure)

Abstract

Data movement between the main memory and the processor is a key contributor to execution time and energy consumption in memory-intensive applications. This data movement bottleneck can be alleviated using Processing-in-Memory (PiM). One category of PiM is Processing-using-Memory (PuM), in which computation takes place inside the memory array by exploiting intrinsic analog properties of the memory device. PuM yields high performance and energy efficiency, but existing PuM techniques support a limited range of operations. As a result, current PuM architectures cannot efficiently perform some complex operations (e.g., multiplication, division, exponentiation) without large increases in chip area and design complexity. To overcome these limitations of existing PuM architectures, we introduce pLUTo (processing-using-memory with lookup table (LUT) operations), a DRAM-based PuM architecture that leverages the high storage density of DRAM to enable the massively parallel storing and querying of lookup tables (LUTs). The key idea of pLUTo is to replace complex operations with low-cost, bulk memory reads (i.e., LUT queries) instead of relying on complex extra logic. We evaluate pLUTo across 11 real-world workloads that showcase the limitations of prior PuM approaches and show that our solution outperforms optimized CPU and GPU base-lines by an average of 713 × and 1.2 ×, respectively, while simultaneously reducing energy consumption by an average of 1855 × and 39.5 ×. Across these workloads, pLUTo outperforms state-of-the-art PiM architectures by an average of 18.3 ×. We also show that different versions of pLUTo provide different levels of flexibility and performance at different additional DRAM area overheads (between 10.2% and 23.1%). pLUTo's source code and all scripts required to reproduce the results of this paper are openly and fully available at https://github.com/CMU-SAFARI/pLUTo.

Original language	English
Title of host publication	Proceedings - 2022 55th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2022
Publisher	IEEE
Pages	900-919
Number of pages	20
ISBN (Electronic)	978-1-6654-6272-3
DOIs	https://doi.org/10.1109/MICRO56248.2022.00067
Publication status	Published - 2022
Event	55th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2022 - Chicago, United States Duration: 1 Oct 2022 → 5 Oct 2022

Publication series

Name	Proceedings of the Annual International Symposium on Microarchitecture, MICRO
Volume	2022-October
ISSN (Print)	1072-4451

Conference

Conference	55th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2022
Country/Territory	United States
City	Chicago
Period	1/10/22 → 5/10/22

Bibliographical note

Green Open Access added to TU Delft Institutional Repository ‘You share, we take care!’ – Taverne project https://www.openaccess.nl/en/you-share-we-take-care Otherwise as indicated in the copyright section: the publisher is the copyright holder of this work and the author uses the Dutch legislation to make this work public.

UN SDGs

This output contributes to the following UN Sustainable Development Goals (SDGs)

Access to Document

10.1109/MICRO56248.2022.00067

pLUTo_Enabling_Massively_Parallel_Computation_in_DRAM_via_Lookup_TablesFinal published version, 1.32 MB

Cite this

Ferreira, J. D., Falcao, G., Gomez-Luna, J., Alser, M., Orosa, L., Sadrosadati, M., Kim, J. S., Oliveira, G. F., Shahroodi, T., & More Authors (2022). pLUTo: Enabling Massively Parallel Computation in DRAM via Lookup Tables. In Proceedings - 2022 55th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2022 (pp. 900-919). (Proceedings of the Annual International Symposium on Microarchitecture, MICRO; Vol. 2022-October). IEEE. https://doi.org/10.1109/MICRO56248.2022.00067

@inproceedings{f177de1e892c4aa481c780882cff03bb,

title = "pLUTo: Enabling Massively Parallel Computation in DRAM via Lookup Tables",

abstract = "Data movement between the main memory and the processor is a key contributor to execution time and energy consumption in memory-intensive applications. This data movement bottleneck can be alleviated using Processing-in-Memory (PiM). One category of PiM is Processing-using-Memory (PuM), in which computation takes place inside the memory array by exploiting intrinsic analog properties of the memory device. PuM yields high performance and energy efficiency, but existing PuM techniques support a limited range of operations. As a result, current PuM architectures cannot efficiently perform some complex operations (e.g., multiplication, division, exponentiation) without large increases in chip area and design complexity. To overcome these limitations of existing PuM architectures, we introduce pLUTo (processing-using-memory with lookup table (LUT) operations), a DRAM-based PuM architecture that leverages the high storage density of DRAM to enable the massively parallel storing and querying of lookup tables (LUTs). The key idea of pLUTo is to replace complex operations with low-cost, bulk memory reads (i.e., LUT queries) instead of relying on complex extra logic. We evaluate pLUTo across 11 real-world workloads that showcase the limitations of prior PuM approaches and show that our solution outperforms optimized CPU and GPU base-lines by an average of 713 × and 1.2 ×, respectively, while simultaneously reducing energy consumption by an average of 1855 × and 39.5 ×. Across these workloads, pLUTo outperforms state-of-the-art PiM architectures by an average of 18.3 ×. We also show that different versions of pLUTo provide different levels of flexibility and performance at different additional DRAM area overheads (between 10.2% and 23.1%). pLUTo's source code and all scripts required to reproduce the results of this paper are openly and fully available at https://github.com/CMU-SAFARI/pLUTo. ",

author = "Ferreira, {Joao Dinis} and Gabriel Falcao and Juan Gomez-Luna and Mohammed Alser and Lois Orosa and Mohammad Sadrosadati and Kim, {Jeremie S.} and Oliveira, {Geraldo F.} and Taha Shahroodi and {More Authors}",

note = "Green Open Access added to TU Delft Institutional Repository {\textquoteleft}You share, we take care!{\textquoteright} – Taverne project https://www.openaccess.nl/en/you-share-we-take-care Otherwise as indicated in the copyright section: the publisher is the copyright holder of this work and the author uses the Dutch legislation to make this work public.; 55th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2022 ; Conference date: 01-10-2022 Through 05-10-2022",

year = "2022",

doi = "10.1109/MICRO56248.2022.00067",

language = "English",

series = "Proceedings of the Annual International Symposium on Microarchitecture, MICRO",

publisher = "IEEE",

pages = "900--919",

booktitle = "Proceedings - 2022 55th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2022",

address = "United States",

}

Ferreira, JD, Falcao, G, Gomez-Luna, J, Alser, M, Orosa, L, Sadrosadati, M, Kim, JS, Oliveira, GF, Shahroodi, T & More Authors 2022, pLUTo: Enabling Massively Parallel Computation in DRAM via Lookup Tables. in Proceedings - 2022 55th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2022. Proceedings of the Annual International Symposium on Microarchitecture, MICRO, vol. 2022-October, IEEE, pp. 900-919, 55th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2022, Chicago, United States, 1/10/22. https://doi.org/10.1109/MICRO56248.2022.00067

pLUTo: Enabling Massively Parallel Computation in DRAM via Lookup Tables. / Ferreira, Joao Dinis; Falcao, Gabriel; Gomez-Luna, Juan et al.
Proceedings - 2022 55th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2022. IEEE, 2022. p. 900-919 (Proceedings of the Annual International Symposium on Microarchitecture, MICRO; Vol. 2022-October).

Research output: Chapter in Book/Conference proceedings/Edited volume › Conference contribution › Scientific › peer-review

TY - GEN

T1 - pLUTo

T2 - 55th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2022

AU - Ferreira, Joao Dinis

AU - Falcao, Gabriel

AU - Gomez-Luna, Juan

AU - Alser, Mohammed

AU - Orosa, Lois

AU - Sadrosadati, Mohammad

AU - Kim, Jeremie S.

AU - Oliveira, Geraldo F.

AU - Shahroodi, Taha

AU - More Authors, null

N1 - Green Open Access added to TU Delft Institutional Repository ‘You share, we take care!’ – Taverne project https://www.openaccess.nl/en/you-share-we-take-care Otherwise as indicated in the copyright section: the publisher is the copyright holder of this work and the author uses the Dutch legislation to make this work public.

PY - 2022

Y1 - 2022

N2 - Data movement between the main memory and the processor is a key contributor to execution time and energy consumption in memory-intensive applications. This data movement bottleneck can be alleviated using Processing-in-Memory (PiM). One category of PiM is Processing-using-Memory (PuM), in which computation takes place inside the memory array by exploiting intrinsic analog properties of the memory device. PuM yields high performance and energy efficiency, but existing PuM techniques support a limited range of operations. As a result, current PuM architectures cannot efficiently perform some complex operations (e.g., multiplication, division, exponentiation) without large increases in chip area and design complexity. To overcome these limitations of existing PuM architectures, we introduce pLUTo (processing-using-memory with lookup table (LUT) operations), a DRAM-based PuM architecture that leverages the high storage density of DRAM to enable the massively parallel storing and querying of lookup tables (LUTs). The key idea of pLUTo is to replace complex operations with low-cost, bulk memory reads (i.e., LUT queries) instead of relying on complex extra logic. We evaluate pLUTo across 11 real-world workloads that showcase the limitations of prior PuM approaches and show that our solution outperforms optimized CPU and GPU base-lines by an average of 713 × and 1.2 ×, respectively, while simultaneously reducing energy consumption by an average of 1855 × and 39.5 ×. Across these workloads, pLUTo outperforms state-of-the-art PiM architectures by an average of 18.3 ×. We also show that different versions of pLUTo provide different levels of flexibility and performance at different additional DRAM area overheads (between 10.2% and 23.1%). pLUTo's source code and all scripts required to reproduce the results of this paper are openly and fully available at https://github.com/CMU-SAFARI/pLUTo.

AB - Data movement between the main memory and the processor is a key contributor to execution time and energy consumption in memory-intensive applications. This data movement bottleneck can be alleviated using Processing-in-Memory (PiM). One category of PiM is Processing-using-Memory (PuM), in which computation takes place inside the memory array by exploiting intrinsic analog properties of the memory device. PuM yields high performance and energy efficiency, but existing PuM techniques support a limited range of operations. As a result, current PuM architectures cannot efficiently perform some complex operations (e.g., multiplication, division, exponentiation) without large increases in chip area and design complexity. To overcome these limitations of existing PuM architectures, we introduce pLUTo (processing-using-memory with lookup table (LUT) operations), a DRAM-based PuM architecture that leverages the high storage density of DRAM to enable the massively parallel storing and querying of lookup tables (LUTs). The key idea of pLUTo is to replace complex operations with low-cost, bulk memory reads (i.e., LUT queries) instead of relying on complex extra logic. We evaluate pLUTo across 11 real-world workloads that showcase the limitations of prior PuM approaches and show that our solution outperforms optimized CPU and GPU base-lines by an average of 713 × and 1.2 ×, respectively, while simultaneously reducing energy consumption by an average of 1855 × and 39.5 ×. Across these workloads, pLUTo outperforms state-of-the-art PiM architectures by an average of 18.3 ×. We also show that different versions of pLUTo provide different levels of flexibility and performance at different additional DRAM area overheads (between 10.2% and 23.1%). pLUTo's source code and all scripts required to reproduce the results of this paper are openly and fully available at https://github.com/CMU-SAFARI/pLUTo.

UR - http://www.scopus.com/inward/record.url?scp=85138727413&partnerID=8YFLogxK

U2 - 10.1109/MICRO56248.2022.00067

DO - 10.1109/MICRO56248.2022.00067

M3 - Conference contribution

AN - SCOPUS:85138727413

T3 - Proceedings of the Annual International Symposium on Microarchitecture, MICRO

SP - 900

EP - 919

BT - Proceedings - 2022 55th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2022

PB - IEEE

Y2 - 1 October 2022 through 5 October 2022

ER -

Ferreira JD, Falcao G, Gomez-Luna J, Alser M, Orosa L, Sadrosadati M et al. pLUTo: Enabling Massively Parallel Computation in DRAM via Lookup Tables. In Proceedings - 2022 55th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2022. IEEE. 2022. p. 900-919. (Proceedings of the Annual International Symposium on Microarchitecture, MICRO). doi: 10.1109/MICRO56248.2022.00067

pLUTo: Enabling Massively Parallel Computation in DRAM via Lookup Tables

Abstract

Publication series

Conference

Bibliographical note

UN SDGs

Access to Document

Other files and links

Fingerprint

Cite this