-
Notifications
You must be signed in to change notification settings - Fork 171
[FEA]: Faster initialization time for cuda.core
abstractions
#658
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
We need to look at the nvmath-python perf issue more closely. Can you please share how you got the conclusion that It is known that the |
As noted in the OP, event creation is much slower than CuPy: In [7]: %timeit cp.cuda.Event()
722 ns ± 2.3 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each) vs In [5]: %timeit e = dev.create_event()
4.59 μs ± 10.4 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each) FWIW CuPy's timing is probably simpler to explain in a hand-waving way: It has ~200 ns spent in creating a cdef class object + ~500 ns on the actual CUDA call: In [6]: %timeit driver.cuEventCreate(2)
434 ns ± 1.42 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each) Looking into what's happening on the In [4]: %timeit dev.context
1.04 μs ± 3.58 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each) Among the ~1 us, here's a rough breakdown: about ~600 ns in the actual work In [17]: %timeit driver.cuCtxGetCurrent()
124 ns ± 1.13 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
In [22]: out = driver.cuCtxGetCurrent()
In [23]: %timeit handle_return(out)
197 ns ± 0.174 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
In [15]: %timeit int(ctx)
61 ns ± 0.681 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
In [16]: %timeit Context._from_ctx(ctx, 0)
236 ns ± 0.606 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each) and the remaining ~400 ns in spent on this
which we copied from cuQuantum Python to nvmath-python to cuda.core . So about 600 ns are used by helper functions (handle_return & precondition ).
I have also looked into the remaining ~3 us in creating an event. Similar to the above finding we have time consumed here and there in Python easily. It is simple to get a sense of this through line_profiler (but for hotspot analysis like this the exact timings reported by it cannot be trusted):
So about 24% of time spent on creating a destructor ( It seems to me we are paying the technical debts of ignoring small problem sizes. However, the above analysis still does not answer "why do we see a ~50 us overhead in nvmath-python?" AFAIK there's only one event creation per |
FWIW the stream creation performance is a lot closer In [6]: %timeit s = dev.create_stream()
7.47 μs ± 17.6 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
In [7]: %timeit cp.cuda.Stream()
6.33 μs ± 102 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
In [8]: %timeit s = cp.cuda.Stream()
3.69 μs ± 11.8 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each) |
Our investigation started when Satya noticed our performance benchmarks (timed matmuls of various sizes) were slower for the upcoming release. I then stepped through the git history since v0.3 looking at when the performance benchmark decreased. The performance drops were all correlated with the replacement of cupy abstractions with cuda.core abstractions. Finally, I used a statistical profiler to measure where the hot spots were during the running of these benchmarks both from the previous release and in the current version. Since I was using a statistical profiler, I increased the number of benchmarking trials from 10 to 100 so that small overheads would be captured more often. I also disabled the test with autotune since I didn't care to benchmark the autotuning process. Using a host-side python profiler is fine for this case since we are trying measure the Python overhead. The benchmarks for current and previous release were run in the exact same environment, so all the device-side computation is the same. Used the latest releases of cuda-python as of May 23, 2025. When comparing the call trees in the profiling results, many are similar. When they aren't similar, we follow the call stack down and find a cuda.core abstraction. I cannot publish the profiling results for an unreleased version of nvmath-python, so... check internally. |
One obvious difference between torch/cupy and cuda.core is that these high level constructs are implemented in cython (torch/cupy) instead of python (cuda.core). Is migrating these classes to cython on the roadmap? Do we have any evidence which would suggest what kind of performance benefit we would get? |
@emcastillo is looking into this |
Check #677 to see the actual bottlenecks :) |
Uh oh!
There was an error while loading. Please reload this page.
Is this a duplicate?
Area
cuda.core
Is your feature request related to a problem? Please describe.
As mentioned in a previous issue, equivalent operations using CuPy can be significantly faster. In this issue, I am requesting that the initialization of
cuda.core
abstractions have less overhead. Specifically, when compared to their CuPy counterparts, the initialization of Device, Stream, and Event abstractions are slower.This has caused noticeable performance regressions in
nvmath-python
when transitioning fromcupy.cuda
tocuda.core
for our benchmarks for smaller/medium size arrays or on faster devices where python overhead is more significant.Specifically, we currently use event recording frequently in order to autoselect algorithms/plans and to wait for computation to complete before returning to the user (for host APIs), and we frequently use the Device constructor to check the current device.
Describe the solution you'd like
These init functions should just as fast as or faster than CuPy's abstractions.
Describe alternatives you've considered
Device()
less often by passing around oneDevice
and being careful about context switching.Event
s. I don't think this is feasible.Additional context
I doesn't seem like the best long term solution for
nvmath-python
to try to work around these issues.The text was updated successfully, but these errors were encountered: