The memory copy speed seems to exceed the hardware limit #1862
Closed
Yangyang-Tan
started this conversation in
General
Replies: 1 comment
-
I can't reproduce this here. Try running under NSight Systems, it'll display the memory throughput when highlighting the copy operation. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I'm tring to perform a memory copy on a RTX 4090 GPU. It gives a 2TB/s bandwidth speed. It's clearly exceed the theoretical performance of 4090 which is 1008 GB/s.
To reproduce
The Minimal Working Example (MWE):
The 2 in
T_tot1 = 2 * 1 / 1e9 * nx * ny * sizeof(Float32) / t_it1
comes from the memory read and write.Output
The
T_tot1
andT_tot2
give the outputT_tot1=2452
andT_tot1=2281
. While,T_tot3
gives the outputT_tot3=895
Version info
Details on Julia:
Details on CUDA:
Additional context
I also tried on the RTX 2080TI and the TITAN V. I don't see any performance exceeding. It seems that only on the RTX 4090 with the 2^11 * 2^11 size of Float32 matrix(or 2^10 * 2^10 Float64 matrix) will have this behaviour.
Beta Was this translation helpful? Give feedback.
All reactions