### Abstract

In this paper, we consider the automatic performance tuning of dense vector and matrix-vector operations on GPUs. Such operations form the backbone of level 1 and level 2 routines in the Basic Linear Algebra Subroutines (BLAS) library and are therefore of great importance in many scientific applications. As examples, we develop single-precision CUDA kernels for the Euclidian norm (SNRM2) and the matrix-vector multiplication (SGEMV). The target hardware is the most recent Nvidia Tesla 20-series (Fermi architecture). We show that auto-tuning can be successfully applied to achieve high performance for dense vector and matrix-vector operations by appropriately utilizing the fine-grained parallelism of the GPU. Our tuned kernels display between 25-100% better performance than the current CUBLAS 3.2 library.

Original language | English |
---|---|

Title of host publication | Parallel Processing and Applied Mathematics : 9th International Conference, PPAM 2011 |

Editors | Roman Wyrzykowski, Jack Dongarra, Konrad Karczewski, Jerzy Wasniewski |

Publisher | Springer |

Publication date | 2012 |

Pages | 619-629 |

DOIs | |

Publication status | Published - 2012 |

Event | Parallel Processing and Applied Mathematics. 9th International Conference, PPAM 2011 - Torun, Poland Duration: 11 Sep 2011 → 14 Sep 2011 http://ppam.pl/ |

### Conference

Conference | Parallel Processing and Applied Mathematics. 9th International Conference, PPAM 2011 |
---|---|

Country | Poland |

City | Torun |

Period | 11/09/2011 → 14/09/2011 |

Internet address |

Series | Lecture Notes in Computer Science |
---|---|

Volume | 7203 |

ISSN | 0302-9743 |

### Keywords

- GPU
- BLAS
- Dense linear algebra
- Parallel algorithms

## Cite this

Sørensen, H. H. B. (2012). Auto-tuning Dense Vector and Matrix-vector Operations for Fermi GPUs. In R. Wyrzykowski, J. Dongarra, K. Karczewski, & J. Wasniewski (Eds.),

*Parallel Processing and Applied Mathematics: 9th International Conference, PPAM 2011*(pp. 619-629). Springer. Lecture Notes in Computer Science, Vol.. 7203 https://doi.org/10.1007/978-3-642-31464-3_63