Performance engineering for a tall & skinny matrix multiplication kernels on GPUs

Dominik Ernst*, Georg Hager, Jonas Thies, Gerhard Wellein

*Corresponding author for this work

Research output: Chapter in Book/Conference proceedings/Edited volumeConference contributionScientificpeer-review

4 Citations (Scopus)

Abstract

General matrix-matrix multiplications (GEMM) in vendor-supplied BLAS libraries are best optimized for square matrices but often show bad performance for tall & skinny matrices, which are much taller than wide. Nvidia’s current CUBLAS implementation delivers only a fraction of the potential performance (as given by the roofline model) in this case. We describe the challenges and key properties of an implementation that can achieve perfect performance. We further evaluate different approaches of parallelization and thread distribution, and devise a flexible, configurable mapping scheme. A code generation approach enables a simultaneously flexible and specialized implementation with autotuning. This results in perfect performance for a large range of matrix sizes in the domain of interest, and at least 2/3 of maximum performance for the rest on an Nvidia Volta GPGPU.

Original languageEnglish
Title of host publicationParallel Processing and Applied Mathematics - 13th International Conference, PPAM 2019, Revised Selected Papers
EditorsRoman Wyrzykowski, Konrad Karczewski, Ewa Deelman, Jack Dongarra
PublisherSpringer Nature
Pages505-515
Number of pages11
ISBN (Print)9783030432287
DOIs
Publication statusPublished - 2020
Externally publishedYes
Event13th International Conference on Parallel Processing and Applied Mathematics, PPAM 2019 - Bialystok, Poland
Duration: 8 Sept 201911 Sept 2019

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume12043 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference13th International Conference on Parallel Processing and Applied Mathematics, PPAM 2019
Country/TerritoryPoland
CityBialystok
Period8/09/1911/09/19

Keywords

  • GPU
  • Matrix multiplication
  • Tall & skinny

Fingerprint

Dive into the research topics of 'Performance engineering for a tall & skinny matrix multiplication kernels on GPUs'. Together they form a unique fingerprint.

Cite this