PaScaL_TDMA 2.1: A register-resident multi-GPU tridiagonal matrix solver with optimized communication for large-scale CFD simulations
Description
We present PaScaL_TDMA 2.1, a GPU-oriented release of the PaScaL_TDMA library [3] for efficiently solving large batches of distributed tridiagonal systems on modern multi-GPU platforms. Building on the original CPU-based PaScaL_TDMA formulation and the shared-memory buffering strategy introduced in PaScaL_TDMA 2.0 [2], version 2.1 reformulates the core kernels and communication path to better match the GPU execution model. CUDA threads are mapped to contiguous tridiagonal lines to achieve coalesced global-memory access, and the elimination kernels are optimized to a fully register-resident implementation to reduce memory traffic and synchronization. To lower inter-GPU overhead, the reduced-system assembly is performed via a single consolidated MPI_Alltoall exchange, and the kernel interface is restructured to eliminate descriptor transfers at launch. Benchmarks on the NURION system show that PaScaL_TDMA 2.1 reduces wall time from 0.127 s on dual-socket Intel Skylake CPUs to 9.2 ms on an NVIDIA A100 and 6.1 ms on an H100, corresponding to speedups of 14.0 × and 20.7 × , respectively. Strong- and weak-scaling studies quantify the performance gains from the optimization stages and demonstrate sustained scalability on multi-GPU systems. Finally, PaScaL_TDMA 2.1 is integrated into an immersed-boundary LES solver and validated through large-scale CFD simulations, including an industrial-scale cleanroom configuration with up to 128 A100 GPUs and O(10^10) degrees of freedom.