Projects per year
Abstract
The main objective with the present study has been to investigate parallel numerical algorithms with the purpose of running efficiently and scalably on modern manycore heterogeneous hardware. In order to obtain good efficiency and scalability on modern multi and many core architectures, algorithms and data structures must be designed to utilize the underlying parallel architecture. The architectural changes in hardware design within the last decade, from single to multi and manycore architectures, require software developers to identify and properly implement methods that both exploit concurrency and maintain numerical efficiency.
Graphical Processing Units (GPUs) have proven to be very e_ective units for computing the solution of scientific problems described by partial differential equations (PDEs). GPUs have today become standard devices in portable, desktop, and supercomputers, which makes parallel software design applicable, but also a challenge for scientific software developers at all levels. We have developed a generic C++ library for fast prototyping of largescale PDEs solvers based on flexibleorder finite difference approximations on structured regular grids. The library is designed with a high abstraction interface to improve developer productivity. The library is based on modern templatebased design concepts as described in Glimberg, EngsigKarup, Nielsen & Dammann (2013). The library utilizes heterogeneous CPU/GPU environments in order to maximize computational throughput by favoring data locality and lowstorage algorithms, which are becoming more and more important as the number of concurrent cores per processor increases.
We demonstrate in a proofofconcept the advantages of the library by assembling a generic nonlinear free surface water wave solver based on unified potential flow theory, for fast simulation of largescale phenomena, such as long distance wave propagation over varying depths or within large coastal regions. Simulations that are valuable within maritime engineering because of the adjustable properties that follow from the flexibleorder implementation. We extend the novel work on an efficient and robust iterative parallel solution strategy proposed by EngsigKarup, Madsen & Glimberg (2011), for the bottleneck problem of solving a _transformed Laplace problem in three dimensions at every time integration step. A geometric multigrid preconditioned defect correction scheme is used to attain highorder accurate solutions with fast convergence and scalable work effort. To minimize data storage and enhance performance, the numerical method is based on matrixfree finite difference approximations, implemented to run efficiently on manycore GPUs. Also, singleprecision calculations are found to be attractive for reducing transfers and enhancing performance for both pure single and mixedprecision calculations without compromising robustness. A structured multiblock approach is presented that decomposes the problem into several subdomains, supporting flexible block structures to match the physical domain. For data communication across processor nodes, messages are sent using MPI to repeatedly update boundary information between adjacent coupled subdomains. The impact on convergence and performance scalability using the proposed hybrid CUDAMPI strategy will be presented. A survey of the convergence and performance properties of the preconditioned defect correction method is carried out with special focus on largescale multiGPU simulations. Results indicate that a limited number of multigrid restrictions are required, and that it is strongly coupled to the wave resolutions. These results are encouraging for the heterogeneous multiGPU systems as they reduce the communication overhead signifficantly and prevent both global coarse grid corrections and inefficient processor utilization at the coarsest levels.
We find that spatial domain decomposition scales well for large problems sizes, but for problems of limited sizes, the maximum attainable speedup is reached for a low number of processors, as it leads to an unfavorable communication to compute ratio. To circumvent this, we have considered a recently proposed parallelintime algorithm referred to as Parareal, in an attempt to introduce algorithmic concurrency in the time discretization. Parareal may be perceived as a two level multigrid method in time, where the numerical solution is first sequentially advanced via course integration and then updated simultaneously on multiple GPUs in a predictorcorrector fashion. A parameter study is performed to establish proper choices for maximizing speedup and parallel effciency. The Parareal algorithm is found to be sensitive to a number of numerical and physical parameters, making practical speedup a matter of parameter tuning. Results are presented to confirm that it is possible to attain reasonable speedups, independently of the spatial problem size.
To improve application range, curvilinear grid transformations are introduced to allow representation of complex boundary geometries. The curvilinear transformations increase the complexity of the implementation of the model equations. A number of free surface water wave cases have been demonstrated with boundaryfitted geometries, where the combination of a flexible geometry representation and a fast numerical solver can be a valuable engineering tool for largescale simulation of real maritime scenarios.
The present study touches some of the many possibilities that modern heterogeneous computing can bring if careful and parallelaware design decisions are made. Though several free surface examples are outlined, we are yet to demonstrate results from a real largescale engineering case.
Graphical Processing Units (GPUs) have proven to be very e_ective units for computing the solution of scientific problems described by partial differential equations (PDEs). GPUs have today become standard devices in portable, desktop, and supercomputers, which makes parallel software design applicable, but also a challenge for scientific software developers at all levels. We have developed a generic C++ library for fast prototyping of largescale PDEs solvers based on flexibleorder finite difference approximations on structured regular grids. The library is designed with a high abstraction interface to improve developer productivity. The library is based on modern templatebased design concepts as described in Glimberg, EngsigKarup, Nielsen & Dammann (2013). The library utilizes heterogeneous CPU/GPU environments in order to maximize computational throughput by favoring data locality and lowstorage algorithms, which are becoming more and more important as the number of concurrent cores per processor increases.
We demonstrate in a proofofconcept the advantages of the library by assembling a generic nonlinear free surface water wave solver based on unified potential flow theory, for fast simulation of largescale phenomena, such as long distance wave propagation over varying depths or within large coastal regions. Simulations that are valuable within maritime engineering because of the adjustable properties that follow from the flexibleorder implementation. We extend the novel work on an efficient and robust iterative parallel solution strategy proposed by EngsigKarup, Madsen & Glimberg (2011), for the bottleneck problem of solving a _transformed Laplace problem in three dimensions at every time integration step. A geometric multigrid preconditioned defect correction scheme is used to attain highorder accurate solutions with fast convergence and scalable work effort. To minimize data storage and enhance performance, the numerical method is based on matrixfree finite difference approximations, implemented to run efficiently on manycore GPUs. Also, singleprecision calculations are found to be attractive for reducing transfers and enhancing performance for both pure single and mixedprecision calculations without compromising robustness. A structured multiblock approach is presented that decomposes the problem into several subdomains, supporting flexible block structures to match the physical domain. For data communication across processor nodes, messages are sent using MPI to repeatedly update boundary information between adjacent coupled subdomains. The impact on convergence and performance scalability using the proposed hybrid CUDAMPI strategy will be presented. A survey of the convergence and performance properties of the preconditioned defect correction method is carried out with special focus on largescale multiGPU simulations. Results indicate that a limited number of multigrid restrictions are required, and that it is strongly coupled to the wave resolutions. These results are encouraging for the heterogeneous multiGPU systems as they reduce the communication overhead signifficantly and prevent both global coarse grid corrections and inefficient processor utilization at the coarsest levels.
We find that spatial domain decomposition scales well for large problems sizes, but for problems of limited sizes, the maximum attainable speedup is reached for a low number of processors, as it leads to an unfavorable communication to compute ratio. To circumvent this, we have considered a recently proposed parallelintime algorithm referred to as Parareal, in an attempt to introduce algorithmic concurrency in the time discretization. Parareal may be perceived as a two level multigrid method in time, where the numerical solution is first sequentially advanced via course integration and then updated simultaneously on multiple GPUs in a predictorcorrector fashion. A parameter study is performed to establish proper choices for maximizing speedup and parallel effciency. The Parareal algorithm is found to be sensitive to a number of numerical and physical parameters, making practical speedup a matter of parameter tuning. Results are presented to confirm that it is possible to attain reasonable speedups, independently of the spatial problem size.
To improve application range, curvilinear grid transformations are introduced to allow representation of complex boundary geometries. The curvilinear transformations increase the complexity of the implementation of the model equations. A number of free surface water wave cases have been demonstrated with boundaryfitted geometries, where the combination of a flexible geometry representation and a fast numerical solver can be a valuable engineering tool for largescale simulation of real maritime scenarios.
The present study touches some of the many possibilities that modern heterogeneous computing can bring if careful and parallelaware design decisions are made. Though several free surface examples are outlined, we are yet to demonstrate results from a real largescale engineering case.
Original language  English 

Place of Publication  Kgs. Lyngby 

Publisher  Technical University of Denmark 
Number of pages  153 
Publication status  Published  2013 
Series  DTU Compute PHD2013 

Number  317 
ISSN  09093192 
Fingerprint Dive into the research topics of 'Designing Scientific Software for Heterogeneous Computing: With application to largescale water wave simulations'. Together they form a unique fingerprint.
Projects
 1 Finished

Scientific GPU Computing for PDE Solvers
Glimberg, S. L., EngsigKarup, A. P., Dammann, B., Walther, J. H., Cai, X. & Olson, L.
01/05/2010 → 12/12/2013
Project: PhD