Execution time of double-precision and high-precision GEMM implementations on Intel Core i5-7500 and NVIDIA Turing RTX 2080

Name: Execution time of double-precision and high-precision GEMM implementations on Intel Core i5-7500 and NVIDIA Turing RTX 2080
Creator: Konstantin Isupov
Published: 2022-12-19T17:55:04.060Z
Keywords: High Performance Computing, Graphics Processor, Multiple Precision Arithmetic, Computer Arithmetic, Basic Linear Algebra

Isupov, Konstantin

doi:10.17632/5dgdc42x7p.1

Execution time of double-precision and high-precision GEMM implementations on Intel Core i5-7500 and NVIDIA Turing RTX 2080

Published: 19 December 2022| Version 1 | DOI: 10.17632/5dgdc42x7p.1

Contributor:

Konstantin Isupov

Description

This dataset contains the execution time for matrix-matrix multiplication kernels with general matrices (GEMM, BLAS Level 3) implemented using existing double-precision linear algebra software as well as multiple-precision libraries for CPU and GPU. The operation is C = α * op(A) * op(B) + β * C, where α and β are scalars, A, B, C are matrices, op(A) is an M-by-K matrix, op(B) is a K-by-N matrix, C is an M-by-N matrix, and op(X) is one of op(X) = X or op(X) = X^T. Each raw file provided contains the results of three test runs in milliseconds. The complete source code for the tests can be found at https://github.com/kisupov/mpres-blas. Common experiment settings: • Dense, random, 1000-by-1000 general matrices A, B and C; • Random scalars α and β; • Measurements are in milliseconds; • Arithmetic precision from 106 to 424 bits. Test cases considered: • Non transposed: op(A) = A, op(B) = B; • Transposed A: op(A) = A^T, op(B) = B; • Transposed B: op(A) = A, op(B) = B^T; • Transposed both A and B: op(A) = A^T, op(B) = B^T; Experimental environment: • Intel Core i5 7500 processor; • 32GB of DDR4 system memory; • NVIDIA Turing RTX 2080 GPU (2944 CUDA Cores, Compute Capability 7.5, 8GB of GDDR6 memory); • Ubuntu 20.04.5 LTS; • NVIDIA Driver V455.32.00; • CUDA Toolkit V11.1. The following GEMM implementations are evaluated: • OpenBLAS (OpenMP, 53 bits) – double-precision implementation for CPU using OpenBLAS (https://github.com/xianyi/OpenBLAS); • Custom double on CPU (OpenMP, 53 bits) – custom double-precision parallel (OpenMP) implementation; • MPFR (OpenMP) – multiple-precision parallel implementation using the GNU MPFR Library for CPU (https://www.mpfr.org/); • cuBLAS (53 bits) – double-precision implementation for CUDA using the NVIDIA Basic Linear Algebra Subroutines library (https://docs.nvidia.com/cuda/cublas/index.html); • Custom double on GPU (53 bits) – custom double-precision CUDA implementation; • MPRES-BLAS – multiple-precision CUDA implementation using MPRES-BLAS library (https://github.com/kisupov/mpres-blas); • CAMPARY – multiple-precision CUDA implementation using CAMPARY library (https://homepages.laas.fr/mmjoldes/campary/).

Execution time of double-precision and high-precision GEMM implementations on Intel Core i5-7500 and NVIDIA Turing RTX 2080

Description

Files

Categories

Licence