PaScaL_TDMA 2.1: A register-resident multi-GPU tridiagonal matrix solver with optimized communication for large-scale CFD simulations

Name: PaScaL_TDMA 2.1: A register-resident multi-GPU tridiagonal matrix solver with optimized communication for large-scale CFD simulations
Creator: Ki-Ha Kim
Published: 2026-03-20T05:31:28.814Z
Keywords: High Performance Computing, Computational Fluid Dynamics, Computational Physics

Kim, Ki-Ha; Lee, Dongjin; Lee, Junhwan; Oh, Sehyeong; Lee, Seungwon; Kang, Ji-Hoon; Choi, Jung-Il

doi:10.17632/49z6fh94z3.3

PaScaL_TDMA 2.1: A register-resident multi-GPU tridiagonal matrix solver with optimized communication for large-scale CFD simulations

Published: 20 March 2026| Version 3 | DOI: 10.17632/49z6fh94z3.3

Contributors:

,

Description

We present PaScaL_TDMA 2.1, a GPU-oriented release of the PaScaL_TDMA library [3] for efficiently solving large batches of distributed tridiagonal systems on modern multi-GPU platforms. Building on the original CPU-based PaScaL_TDMA formulation and the shared-memory buffering strategy introduced in PaScaL_TDMA 2.0 [2], version 2.1 reformulates the core kernels and communication path to better match the GPU execution model. CUDA threads are mapped to contiguous tridiagonal lines to achieve coalesced global-memory access, and the elimination kernels are optimized to a fully register-resident implementation to reduce memory traffic and synchronization. To lower inter-GPU overhead, the reduced-system assembly is performed via a single consolidated MPI_Alltoall exchange, and the kernel interface is restructured to eliminate descriptor transfers at launch. Benchmarks on the NURION system show that PaScaL_TDMA 2.1 reduces wall time from 0.127 s on dual-socket Intel Skylake CPUs to 9.2 ms on an NVIDIA A100 and 6.1 ms on an H100, corresponding to speedups of 14.0 ×  and 20.7 × , respectively. Strong- and weak-scaling studies quantify the performance gains from the optimization stages and demonstrate sustained scalability on multi-GPU systems. Finally, PaScaL_TDMA 2.1 is integrated into an immersed-boundary LES solver and validated through large-scale CFD simulations, including an industrial-scale cleanroom configuration with up to 128 A100 GPUs and O(10^10) degrees of freedom.

PaScaL_TDMA 2.1: A register-resident multi-GPU tridiagonal matrix solver with optimized communication for large-scale CFD simulations

Description

Files

Categories

Licence