Loading…

High-order accurate simulation of incompressible turbulent flows on many parallel GPUs of a hybrid-node supercomputer

Turbulent incompressible flows play an important role in a broad range of natural and industrial processes. High-order direct numerical simulations are often used for resolving the spatio-temporal scales of such flows. Such high-fidelity simulations require an extensive computational layout which of...

Full description

Saved in:
Bibliographic Details
Published in:Computer physics communications 2019-11, Vol.244, p.132-142
Main Authors: Zolfaghari, Hadi, Becsek, Barna, Nestola, Maria G.C., Sawyer, William B., Krause, Rolf, Obrist, Dominik
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Turbulent incompressible flows play an important role in a broad range of natural and industrial processes. High-order direct numerical simulations are often used for resolving the spatio-temporal scales of such flows. Such high-fidelity simulations require an extensive computational layout which often results in prohibitive computational costs. Recent advances in modern computing platforms, such as GPU-powered hybrid-node supercomputers, appear to become an enabler for high-fidelity CFD at large scales. In this work, we propose methods for accelerating a distributed-memory high-order incompressible Navier–Stokes solver by using NVIDIA Pascal GPUs of a Cray XC40/50 supercomputer. Arithmetically intensive or chronically invoked routines were ported to the GPUs using CUDA C. Host-side driver routines were developed to invoke CUDA C “external” kernels from the FORTRAN legacy code. Numerical methods, for some of the most intensive operations, namely multigrid preconditioners, were modified to be suited to the SIMD standard for graphics processors. Customized unit testing was performed to ensure double-precision accuracy of GPU computations. The optimization layer maintained the memory structure of the legacy code. Post-profiling confirms that backbone distributed memory communications increase the number of dynamic CPU–GPU memory copies, which offsets a part of the computational performance. Strong scalability of the entire flow solver and of the stand-alone pressure solver has been examined on up to 512 P100 GPUs. Strong scaling efficiency decreased for higher numbers of GPUs, probably due to a less favorable communication-to-computation ratio. Weak scalability of the entire solver was tested on up to 4096 P100 GPUs for two problems of different sizes. The solver maintained nearly ideal weak scalability for the larger problem, illustrating the potential of GPUs in dealing with highly resolved flows. The GPU-enabled solver is finally deployed for the scale-resolving simulation of flow transition in the wake of a solid sphere at Re=3700, by utilizing 192 GPUs. The time-averaged pressure coefficient along the sphere surface was in good agreement with previously reported data acquired from CPU-based direct numerical simulations and experiments.
ISSN:0010-4655
1879-2944
DOI:10.1016/j.cpc.2019.06.012