Abstract
In this paper, we develop a high-performance GPU kernel for one of the most popular dense linear algebra operations, the matrix-vector multiplication. The target hardware is the most recent Nvidia Tesla 20-series (Fermi architecture), which is designed from the ground up for scientific computing. We show that it is essentially a matter of fully utilizing the fine-grained parallelism of the many-core GPU in order to achieve high-performance for dense matrix-vector multiplication. We show that auto-tuning can be successfully employed to the GPU kernel so that it performs well for all matrix shapes and sizes.
Original language | English |
---|---|
Title of host publication | Euro-Par 2011 |
Publisher | Springer |
Publication date | 2012 |
Pages | 377-386 |
ISBN (Print) | 978-3-642-29736-6 |
ISBN (Electronic) | 978-3-642-29737-3 |
DOIs | |
Publication status | Published - 2012 |
Event | Euro-Par 2011 - Bordeaux II University, Bordeaux, France Duration: 29 Aug 2011 → 2 Sept 2011 http://europar2011.bordeaux.inria.fr/ |
Conference
Conference | Euro-Par 2011 |
---|---|
Location | Bordeaux II University |
Country/Territory | France |
City | Bordeaux |
Period | 29/08/2011 → 02/09/2011 |
Internet address |
Series | Lecture Notes in Computer Science |
---|---|
Number | 7155 |
ISSN | 0302-9743 |
Keywords
- GPU
- Matrix-Vector Multiplication
- Dense linear algebra