Auto-tuning Dense Vector and Matrix-vector Operations for Fermi GPUs

Publication: Research - peer-reviewArticle in proceedings – Annual report year: 2012

View graph of relations

In this paper, we consider the automatic performance tuning of dense vector and matrix-vector operations on GPUs. Such operations form the backbone of level 1 and level 2 routines in the Basic Linear Algebra Subroutines (BLAS) library and are therefore of great importance in many scientific applications. As examples, we develop single-precision CUDA kernels for the Euclidian norm (SNRM2) and the matrix-vector multiplication (SGEMV). The target hardware is the most recent Nvidia Tesla 20-series (Fermi architecture). We show that auto-tuning can be successfully applied to achieve high performance for dense vector and matrix-vector operations by appropriately utilizing the fine-grained parallelism of the GPU. Our tuned kernels display between 25-100% better performance than the current CUBLAS 3.2 library.
Original languageEnglish
TitleParallel Processing and Applied Mathematics : 9th International Conference, PPAM 2011
EditorsRoman Wyrzykowski, Jack Dongarra, Konrad Karczewski, Jerzy Wasniewski
PublisherSpringer
Publication date2012
Pages619-629
DOIs
StatePublished

Conference

ConferenceParallel Processing and Applied Mathematics. 9th International Conference, PPAM 2011
CountryPoland
CityTorun
Period11/09/1114/09/11
Internet addresshttp://ppam.pl/
NameLecture Notes in Computer Science
Volume7203
ISSN (Print)0302-9743
CitationsWeb of Science® Times Cited: No match on DOI

Keywords

  • GPU, BLAS, Dense linear algebra, Parallel algorithms
Download as:
Download as PDF
Select render style:
APAAuthorCBEHarvardMLAStandardVancouverShortLong
PDF
Download as HTML
Select render style:
APAAuthorCBEHarvardMLAStandardVancouverShortLong
HTML
Download as Word
Select render style:
APAAuthorCBEHarvardMLAStandardVancouverShortLong
Word

ID: 10197598