Loading…

Multi-GPU implementation of a time-explicit finite volume solver using CUDA and a CUDA-Aware version of OpenMPI with application to shallow water flows

This paper shows the development of a multi-GPU version of a time-explicit finite volume solver for the Shallow-Water Equations (SWE) on a multi-GPU architecture. MPI is combined with CUDA-Fortran in order to use as many GPUs as needed and the METIS library is leveraged to perform a domain decomposi...

Full description

Saved in:
Bibliographic Details
Published in:Computer physics communications 2022-02, Vol.271, p.108190, Article 108190
Main Authors: Delmas, Vincent, Soulaïmani, Azzedine
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:This paper shows the development of a multi-GPU version of a time-explicit finite volume solver for the Shallow-Water Equations (SWE) on a multi-GPU architecture. MPI is combined with CUDA-Fortran in order to use as many GPUs as needed and the METIS library is leveraged to perform a domain decomposition on the 2D unstructured triangular meshes of interest. A CUDA-Aware version of OpenMPI is adopted to speed up the messages between the MPI processes. A study of both speed-up and efficiency is conducted; first, for a classic dam-break flow in a canal, and then for two real domains with complex bathymetries. In both cases, meshes with up to 12 million cells are used. Using 24 to 28 GPUs on these meshes leads to an efficiency of 80% and more. Finally, the multi-GPU version is compared to the pure MPI multi-CPU version, and it is concluded that in this particular case, about 100 CPU cores would be needed to achieve the same performance as one GPU. The developed methodology is applicable for general time-explicit Riemann solvers for conservation laws. •Multi-GPU version of a finite volume solver for the Shallow-Water Equations using CUDA and a CUDA-Aware version of OpenMPI.•Domain decomposition of 2D unstructured meshes using METIS with a specific renumbering for efficient memory exchange.•Achievement of a 21x speed-up when using 32 GPUs compared to utilizing a single GPU.•Comparison of the Multi-GPU and Multi-CPU versions of our in-house code shows that 8 GPUs perform as well as 1024 CPU cores.
ISSN:0010-4655
1879-2944
DOI:10.1016/j.cpc.2021.108190