Auto-tuning Dense Vector and Matrix-vector Operations for Fermi GPUs
Publication: Research - peer-review › Article in proceedings – Annual report year: 2012
In this paper, we consider the automatic performance tuning of dense vector and matrix-vector operations on GPUs. Such operations form the backbone of level 1 and level 2 routines in the Basic Linear Algebra Subroutines (BLAS) library and are therefore of great importance in many scientific applications. As examples, we develop single-precision CUDA kernels for the Euclidian norm (SNRM2) and the matrix-vector multiplication (SGEMV). The target hardware is the most recent Nvidia Tesla 20-series (Fermi architecture). We show that auto-tuning can be successfully applied to achieve high performance for dense vector and matrix-vector operations by appropriately utilizing the fine-grained parallelism of the GPU. Our tuned kernels display between 25-100% better performance than the current CUBLAS 3.2 library.
| Original language | English |
|---|---|
| Title | Parallel Processing and Applied Mathematics : 9th International Conference, PPAM 2011 |
| Editors | Roman Wyrzykowski, Jack Dongarra, Konrad Karczewski, Jerzy Wasniewski |
| Publisher | Springer |
| Publication date | 2012 |
| Pages | 619-629 |
| DOIs | |
| State | Published |
Conference
| Conference | Parallel Processing and Applied Mathematics. 9th International Conference, PPAM 2011 |
|---|---|
| Country | Poland |
| City | Torun |
| Period | 11-09-11 → 14-09-11 |
| Internet address | http://ppam.pl/ |
| Name | Lecture Notes in Computer Science |
|---|---|
| Volume | 7203 |
| ISSN (Print) | 0302-9743 |
| Citations | Web of Science® Times Cited: No match on DOI |
|---|
Keywords
- GPU, BLAS, Dense linear algebra, Parallel algorithms
Loading map data...
ID: 10197598