Auto-tuning Dense Vector and Matrix-vector Operations for Fermi GPUs

Publication: Research - peer-reviewArticle in proceedings – Annual report year: 2012

Standard

Auto-tuning Dense Vector and Matrix-vector Operations for Fermi GPUs. / Sørensen, Hans Henrik Brandenborg.

Parallel Processing and Applied Mathematics: 9th International Conference, PPAM 2011. ed. / Roman Wyrzykowski; Jack Dongarra; Konrad Karczewski; Jerzy Wasniewski. Springer, 2012. p. 619-629 (Lecture Notes in Computer Science, Vol. 7203).

Publication: Research - peer-reviewArticle in proceedings – Annual report year: 2012

Harvard

Sørensen, HHB 2012, 'Auto-tuning Dense Vector and Matrix-vector Operations for Fermi GPUs'. in R Wyrzykowski, J Dongarra, K Karczewski & J Wasniewski (eds), Parallel Processing and Applied Mathematics: 9th International Conference, PPAM 2011. Springer, pp. 619-629. Lecture Notes in Computer Science, vol. 7203, , 10.1007/978-3-642-31464-3_63

APA

Sørensen, H. H. B. (2012). Auto-tuning Dense Vector and Matrix-vector Operations for Fermi GPUs. In R. Wyrzykowski, J. Dongarra, K. Karczewski, & J. Wasniewski (Eds.), Parallel Processing and Applied Mathematics: 9th International Conference, PPAM 2011. (pp. 619-629). Springer. (Lecture Notes in Computer Science, Vol. 7203). 10.1007/978-3-642-31464-3_63

CBE

Sørensen HHB. 2012. Auto-tuning Dense Vector and Matrix-vector Operations for Fermi GPUs. Wyrzykowski R, Dongarra J, Karczewski K, Wasniewski J, editors. In Parallel Processing and Applied Mathematics: 9th International Conference, PPAM 2011. Springer. pp. 619-629. (Lecture Notes in Computer Science, Vol. 7203). Available from: 10.1007/978-3-642-31464-3_63

MLA

Sørensen, Hans Henrik Brandenborg "Auto-tuning Dense Vector and Matrix-vector Operations for Fermi GPUs"., Wyrzykowski, Roman and Dongarra, Jack Karczewski, Konrad Wasniewski, Jerzy (ed.). Parallel Processing and Applied Mathematics: 9th International Conference, PPAM 2011. Springer. 2012. 619-629. (Lecture Notes in Computer Science, Volume 7203). Available: 10.1007/978-3-642-31464-3_63

Vancouver

Sørensen HHB. Auto-tuning Dense Vector and Matrix-vector Operations for Fermi GPUs. In Wyrzykowski R, Dongarra J, Karczewski K, Wasniewski J, editors, Parallel Processing and Applied Mathematics: 9th International Conference, PPAM 2011. Springer. 2012. p. 619-629. (Lecture Notes in Computer Science, Vol. 7203). Available from: 10.1007/978-3-642-31464-3_63

Author

Sørensen, Hans Henrik Brandenborg / Auto-tuning Dense Vector and Matrix-vector Operations for Fermi GPUs.

Parallel Processing and Applied Mathematics: 9th International Conference, PPAM 2011. ed. / Roman Wyrzykowski; Jack Dongarra; Konrad Karczewski; Jerzy Wasniewski. Springer, 2012. p. 619-629 (Lecture Notes in Computer Science, Vol. 7203).

Publication: Research - peer-reviewArticle in proceedings – Annual report year: 2012

Bibtex

@inbook{a2ae08a60eae4bc9b2f98abdaa6d7973,
title = "Auto-tuning Dense Vector and Matrix-vector Operations for Fermi GPUs",
publisher = "Springer",
author = "Sørensen, {Hans Henrik Brandenborg}",
year = "2012",
doi = "10.1007/978-3-642-31464-3_63",
editor = "Roman Wyrzykowski and Jack Dongarra and Konrad Karczewski and Jerzy Wasniewski",
series = "Lecture Notes in Computer Science",
pages = "619-629",
booktitle = "Parallel Processing and Applied Mathematics",

}

RIS

TY - GEN

T1 - Auto-tuning Dense Vector and Matrix-vector Operations for Fermi GPUs

A1 - Sørensen,Hans Henrik Brandenborg

AU - Sørensen,Hans Henrik Brandenborg

PB - Springer

PY - 2012

Y1 - 2012

N2 - In this paper, we consider the automatic performance tuning of dense vector and matrix-vector operations on GPUs. Such operations form the backbone of level 1 and level 2 routines in the Basic Linear Algebra Subroutines (BLAS) library and are therefore of great importance in many scientific applications. As examples, we develop single-precision CUDA kernels for the Euclidian norm (SNRM2) and the matrix-vector multiplication (SGEMV). The target hardware is the most recent Nvidia Tesla 20-series (Fermi architecture). We show that auto-tuning can be successfully applied to achieve high performance for dense vector and matrix-vector operations by appropriately utilizing the fine-grained parallelism of the GPU. Our tuned kernels display between 25-100% better performance than the current CUBLAS 3.2 library.

AB - In this paper, we consider the automatic performance tuning of dense vector and matrix-vector operations on GPUs. Such operations form the backbone of level 1 and level 2 routines in the Basic Linear Algebra Subroutines (BLAS) library and are therefore of great importance in many scientific applications. As examples, we develop single-precision CUDA kernels for the Euclidian norm (SNRM2) and the matrix-vector multiplication (SGEMV). The target hardware is the most recent Nvidia Tesla 20-series (Fermi architecture). We show that auto-tuning can be successfully applied to achieve high performance for dense vector and matrix-vector operations by appropriately utilizing the fine-grained parallelism of the GPU. Our tuned kernels display between 25-100% better performance than the current CUBLAS 3.2 library.

KW - GPU

KW - BLAS

KW - Dense linear algebra

KW - Parallel algorithms

U2 - 10.1007/978-3-642-31464-3_63

DO - 10.1007/978-3-642-31464-3_63

BT - Parallel Processing and Applied Mathematics

T2 - Parallel Processing and Applied Mathematics

A2 - Wasniewski,Jerzy

ED - Wasniewski,Jerzy

T3 - Lecture Notes in Computer Science

T3 - en_GB

SP - 619

EP - 629

ER -