## Auto-tuning Dense Vector and Matrix-vector Operations for Fermi GPUs

Publication: Research - peer-review › Article in proceedings – Annual report year: 2012

### Standard

**Auto-tuning Dense Vector and Matrix-vector Operations for Fermi GPUs.** / Sørensen, Hans Henrik Brandenborg.

Publication: Research - peer-review › Article in proceedings – Annual report year: 2012

### Harvard

*Parallel Processing and Applied Mathematics: 9th International Conference, PPAM 2011.*Springer, pp. 619-629. Lecture Notes in Computer Science, vol. 7203, DOI: 10.1007/978-3-642-31464-3_63

### APA

*Parallel Processing and Applied Mathematics: 9th International Conference, PPAM 2011*(pp. 619-629). Springer. (Lecture Notes in Computer Science, Vol. 7203). DOI: 10.1007/978-3-642-31464-3_63

### CBE

### MLA

*Parallel Processing and Applied Mathematics: 9th International Conference, PPAM 2011.*Springer. 2012. 619-629. (Lecture Notes in Computer Science, Volume 7203). Available: 10.1007/978-3-642-31464-3_63

### Vancouver

### Author

### Bibtex

}

### RIS

TY - GEN

T1 - Auto-tuning Dense Vector and Matrix-vector Operations for Fermi GPUs

AU - Sørensen,Hans Henrik Brandenborg

PY - 2012

Y1 - 2012

N2 - In this paper, we consider the automatic performance tuning of dense vector and matrix-vector operations on GPUs. Such operations form the backbone of level 1 and level 2 routines in the Basic Linear Algebra Subroutines (BLAS) library and are therefore of great importance in many scientific applications. As examples, we develop single-precision CUDA kernels for the Euclidian norm (SNRM2) and the matrix-vector multiplication (SGEMV). The target hardware is the most recent Nvidia Tesla 20-series (Fermi architecture). We show that auto-tuning can be successfully applied to achieve high performance for dense vector and matrix-vector operations by appropriately utilizing the fine-grained parallelism of the GPU. Our tuned kernels display between 25-100% better performance than the current CUBLAS 3.2 library.

AB - In this paper, we consider the automatic performance tuning of dense vector and matrix-vector operations on GPUs. Such operations form the backbone of level 1 and level 2 routines in the Basic Linear Algebra Subroutines (BLAS) library and are therefore of great importance in many scientific applications. As examples, we develop single-precision CUDA kernels for the Euclidian norm (SNRM2) and the matrix-vector multiplication (SGEMV). The target hardware is the most recent Nvidia Tesla 20-series (Fermi architecture). We show that auto-tuning can be successfully applied to achieve high performance for dense vector and matrix-vector operations by appropriately utilizing the fine-grained parallelism of the GPU. Our tuned kernels display between 25-100% better performance than the current CUBLAS 3.2 library.

KW - GPU

KW - BLAS

KW - Dense linear algebra

KW - Parallel algorithms

U2 - 10.1007/978-3-642-31464-3_63

DO - 10.1007/978-3-642-31464-3_63

M3 - Article in proceedings

SP - 619

EP - 629

BT - Parallel Processing and Applied Mathematics

PB - Springer

ER -