PaScaL_TDMA 2.1: A register-resident multi-GPU tridiagonal matrix solver with optimized communication for large-scale CFD simulations

Published: 20 March 2026| Version 3 | DOI: 10.17632/49z6fh94z3.3
Contributors:
,
,
,
,
,
,

Description

We present PaScaL_TDMA 2.1, a GPU-oriented release of the PaScaL_TDMA library [3] for efficiently solving large batches of distributed tridiagonal systems on modern multi-GPU platforms. Building on the original CPU-based PaScaL_TDMA formulation and the shared-memory buffering strategy introduced in PaScaL_TDMA 2.0 [2], version 2.1 reformulates the core kernels and communication path to better match the GPU execution model. CUDA threads are mapped to contiguous tridiagonal lines to achieve coalesced global-memory access, and the elimination kernels are optimized to a fully register-resident implementation to reduce memory traffic and synchronization. To lower inter-GPU overhead, the reduced-system assembly is performed via a single consolidated MPI_Alltoall exchange, and the kernel interface is restructured to eliminate descriptor transfers at launch. Benchmarks on the NURION system show that PaScaL_TDMA 2.1 reduces wall time from 0.127 s on dual-socket Intel Skylake CPUs to 9.2 ms on an NVIDIA A100 and 6.1 ms on an H100, corresponding to speedups of 14.0 ×  and 20.7 × , respectively. Strong- and weak-scaling studies quantify the performance gains from the optimization stages and demonstrate sustained scalability on multi-GPU systems. Finally, PaScaL_TDMA 2.1 is integrated into an immersed-boundary LES solver and validated through large-scale CFD simulations, including an industrial-scale cleanroom configuration with up to 128 A100 GPUs and O(10^10) degrees of freedom.

Files

Categories

High Performance Computing, Computational Fluid Dynamics, Computational Physics

Licence