Auto‐tuning of level 1 and level 2 BLAS for GPUs

Research output: Contribution to journalJournal articleResearchpeer-review

Abstract

The use of high‐performance libraries for dense linear algebra operations is of great importance in many numerical scientific applications. The most common operations form the backbone of the Basic Linear Algebra Subroutines (BLAS) library. In this paper, we consider the performance and auto‐tuning of level 1 and level 2 BLAS routines on graphical processing units. As examples, we develop single‐precision Compute Unified Device Architecture kernels for three of the most popular operations, the Euclidian norm (SNRM2), the matrix–vector multiplication (SGEMV), and the triangular solution (STRSV). The target hardware is the most recent Nvidia (Santa Clara, CA, USA) Tesla 20‐series (Fermi architecture), which is designed from the ground up for high‐performance computing. We show that it is essentially a matter of fully utilizing the fine‐grained parallelism of the many‐core graphical processing unit to achieve high performance for level 1 and level 2 BLAS operations. We show that auto‐tuning can be successfully employed to kernels for these operations so that they perform well for all input sizes.
Original languageEnglish
JournalConcurrency and Computation: Practice & Experience
Volume25
Issue number8
Pages (from-to)1183-1198
ISSN1532-0626
DOIs
Publication statusPublished - 2013

Keywords

  • GPU
  • BLAS
  • Dense linear algebra
  • Parallel algorithms

Cite this

@article{3455ab90df1c41f784ecba357156d5a9,
title = "Auto‐tuning of level 1 and level 2 BLAS for GPUs",
abstract = "The use of high‐performance libraries for dense linear algebra operations is of great importance in many numerical scientific applications. The most common operations form the backbone of the Basic Linear Algebra Subroutines (BLAS) library. In this paper, we consider the performance and auto‐tuning of level 1 and level 2 BLAS routines on graphical processing units. As examples, we develop single‐precision Compute Unified Device Architecture kernels for three of the most popular operations, the Euclidian norm (SNRM2), the matrix–vector multiplication (SGEMV), and the triangular solution (STRSV). The target hardware is the most recent Nvidia (Santa Clara, CA, USA) Tesla 20‐series (Fermi architecture), which is designed from the ground up for high‐performance computing. We show that it is essentially a matter of fully utilizing the fine‐grained parallelism of the many‐core graphical processing unit to achieve high performance for level 1 and level 2 BLAS operations. We show that auto‐tuning can be successfully employed to kernels for these operations so that they perform well for all input sizes.",
keywords = "GPU, BLAS, Dense linear algebra, Parallel algorithms",
author = "S{\o}rensen, {Hans Henrik Brandenborg}",
year = "2013",
doi = "10.1002/cpe.2916",
language = "English",
volume = "25",
pages = "1183--1198",
journal = "Concurrency and Computation: Practice & Experience",
issn = "1532-0626",
publisher = "John Wiley & Sons Ltd",
number = "8",

}

Auto‐tuning of level 1 and level 2 BLAS for GPUs. / Sørensen, Hans Henrik Brandenborg.

In: Concurrency and Computation: Practice & Experience, Vol. 25, No. 8, 2013, p. 1183-1198.

Research output: Contribution to journalJournal articleResearchpeer-review

TY - JOUR

T1 - Auto‐tuning of level 1 and level 2 BLAS for GPUs

AU - Sørensen, Hans Henrik Brandenborg

PY - 2013

Y1 - 2013

N2 - The use of high‐performance libraries for dense linear algebra operations is of great importance in many numerical scientific applications. The most common operations form the backbone of the Basic Linear Algebra Subroutines (BLAS) library. In this paper, we consider the performance and auto‐tuning of level 1 and level 2 BLAS routines on graphical processing units. As examples, we develop single‐precision Compute Unified Device Architecture kernels for three of the most popular operations, the Euclidian norm (SNRM2), the matrix–vector multiplication (SGEMV), and the triangular solution (STRSV). The target hardware is the most recent Nvidia (Santa Clara, CA, USA) Tesla 20‐series (Fermi architecture), which is designed from the ground up for high‐performance computing. We show that it is essentially a matter of fully utilizing the fine‐grained parallelism of the many‐core graphical processing unit to achieve high performance for level 1 and level 2 BLAS operations. We show that auto‐tuning can be successfully employed to kernels for these operations so that they perform well for all input sizes.

AB - The use of high‐performance libraries for dense linear algebra operations is of great importance in many numerical scientific applications. The most common operations form the backbone of the Basic Linear Algebra Subroutines (BLAS) library. In this paper, we consider the performance and auto‐tuning of level 1 and level 2 BLAS routines on graphical processing units. As examples, we develop single‐precision Compute Unified Device Architecture kernels for three of the most popular operations, the Euclidian norm (SNRM2), the matrix–vector multiplication (SGEMV), and the triangular solution (STRSV). The target hardware is the most recent Nvidia (Santa Clara, CA, USA) Tesla 20‐series (Fermi architecture), which is designed from the ground up for high‐performance computing. We show that it is essentially a matter of fully utilizing the fine‐grained parallelism of the many‐core graphical processing unit to achieve high performance for level 1 and level 2 BLAS operations. We show that auto‐tuning can be successfully employed to kernels for these operations so that they perform well for all input sizes.

KW - GPU

KW - BLAS

KW - Dense linear algebra

KW - Parallel algorithms

U2 - 10.1002/cpe.2916

DO - 10.1002/cpe.2916

M3 - Journal article

VL - 25

SP - 1183

EP - 1198

JO - Concurrency and Computation: Practice & Experience

JF - Concurrency and Computation: Practice & Experience

SN - 1532-0626

IS - 8

ER -