Recoil 18: RTX 5090 is slower than RTX 2080 in cublasSgemm ?

micjah

Member
Hi all,
I bought this new laptop for machine learning applications that I wrote myself.
I need to accelerate matrix multiplication with single precision (FP32) which I use cublasSgemm since many years.
My hope was that the new laptop GPU NVIDIA GeForce RTX 5090 mobile would be much faster than my 6 years old NVIDIA GeForce RTX 2080 mobile from my current Octane 18" laptop. But benchmarks showed me the opposite! My old GPU is faster! (or equal)

Code:
mich@recoil:~/Downloads$ git clone https://github.com/hma02/cublasgemm-benchmark
mich@recoil:~/Downloads$ cd cublasgemm-benchmark/
mich@recoil:~/Downloads/cublasgemm-benchmark$ nano run.sh   <-- uncomment line 4
mich@recoil:~/Downloads/cublasgemm-benchmark$ ./run.sh
nvcc gemm.cu -lcublas --std=c++11 -arch=sm_60  -o gemm
INFO: Running test for all 1 GPU deivce(s) on host recoil

==================
INFO: testing GPU0
==================
timestamp, index, name, pcie.link.gen.current, pcie.link.gen.max, pstate, clocks.current.graphics [MHz], clocks.max.graphics [MHz]
2025/05/25 10:00:42.125, 0, NVIDIA GeForce RTX 5090 Laptop GPU, 1, 5, P8, 22 MHz, 3090 MHz
2025/05/25 10:00:47.229, 0, NVIDIA GeForce RTX 5090 Laptop GPU, 5, 5, P0, 2152 MHz, 3090 MHz
2025/05/25 10:00:52.231, 0, NVIDIA GeForce RTX 5090 Laptop GPU, 5, 5, P0, 2152 MHz, 3090 MHz
2025/05/25 10:00:57.233, 0, NVIDIA GeForce RTX 5090 Laptop GPU, 5, 5, P2, 1957 MHz, 3090 MHz

cublasSgemm test result:

running with min_m_k_n: 2 max_m_k_n: 16384 repeats: 2
allocating device variables
float32: size 2 average: 0.0114416 s
float32: size 4 average: 2.1184e-05 s
float32: size 8 average: 7.792e-06 s
float32: size 16 average: 6.56e-06 s
float32: size 32 average: 0.00238846 s
float32: size 64 average: 1.6352e-05 s
float32: size 128 average: 1.4144e-05 s
float32: size 256 average: 1.8224e-05 s
float32: size 512 average: 4.6032e-05 s
float32: size 1024 average: 0.000232144 s
float32: size 2048 average: 0.00174157 s
float32: size 4096 average: 0.0146068 s
float32: size 8192 average: 0.138247 s
float32: size 16384 average: 1.09813 s

here the same bechmark on my 6 years old laptop:

Code:
mich@i9:~/Downloads$ git clone https://github.com/hma02/cublasgemm-benchmark
mich@i9:~/Downloads$ cd cublasgemm-benchmark/
mich@i9:~/Downloads/cublasgemm-benchmark$ nano run.sh  <-- uncomment line 4
mich@i9:~/Downloads/cublasgemm-benchmark$ ./run.sh
nvcc gemm.cu -lcublas --std=c++11 -arch=sm_60  -o gemm
INFO: Running test for all 1 GPU deivce(s) on host i9

==================
INFO: testing GPU0
==================
timestamp, index, name, pcie.link.gen.current, pcie.link.gen.max, pstate, clocks.current.graphics [MHz], clocks.max.graphics [MHz]
2025/05/25 10:01:29.915, 0, NVIDIA GeForce RTX 2080, 1, 3, P8, 300 MHz, 2100 MHz
2025/05/25 10:01:34.919, 0, NVIDIA GeForce RTX 2080, 3, 3, P2, 1380 MHz, 2100 MHz
2025/05/25 10:01:39.920, 0, NVIDIA GeForce RTX 2080, 3, 3, P2, 1380 MHz, 2100 MHz

cublasSgemm test result:

running with min_m_k_n: 2 max_m_k_n: 16384 repeats: 2
allocating device variables
float32: size 2 average: 2.7088e-05 s
float32: size 4 average: 6.72e-06 s
float32: size 8 average: 4.992e-06 s
float32: size 16 average: 9.568e-06 s
float32: size 32 average: 1.0832e-05 s
float32: size 64 average: 7.072e-06 s
float32: size 128 average: 9.808e-06 s
float32: size 256 average: 1.3472e-05 s
float32: size 512 average: 5.0576e-05 s
float32: size 1024 average: 0.000238624 s
float32: size 2048 average: 0.00165302 s
float32: size 4096 average: 0.0131607 s
float32: size 8192 average: 0.12615 s
float32: size 16384 average: 1.02281 s

Also my own code bechmark and neural net training shows same result.
Both GPUs (RTX 5090, RTX 2080) are equal fast.
How can this be ?
RTX 5090 mobile has 31TFLOP peak, RTX 2080 mobile only 9TFLOP based on specs.
Anybody here has an explaination or same results ?

Best Regards,
Michael
 

jyoustra

New member
> arch=sm_60

why so old? try arch sm_120a
also fp32 is very de-emphasized on the newer chips so perhaps some of the paths are less optimized
 

micjah

Member
hi jyoustra,
thanks for the tip with the architecture.
I re-run it on the new laptop and I got 0.5s (was 1.0s before).

Then what gave extra speed is insert this line in gemm.cu right after cublasCreate:
auto cublasStat = cublasSetMathMode(handle, CUBLAS_TENSOR_OP_MATH);
with this time goes down to:
float32: size 16384 average: 0.18691 s

(this change also speedup the runtime on my old laptop to 0.5s)
 
Top