[FEA]: Faster initialization time for `cuda.core` abstractions #658

carterbox · 2025-05-23T23:34:17Z

Is this a duplicate?

I confirmed there appear to be no duplicate issues for this request and that I agree to the Code of Conduct

Area

cuda.core

Is your feature request related to a problem? Please describe.

As mentioned in a previous issue, equivalent operations using CuPy can be significantly faster. In this issue, I am requesting that the initialization of cuda.core abstractions have less overhead. Specifically, when compared to their CuPy counterparts, the initialization of Device, Stream, and Event abstractions are slower.

>>> timeit.timeit('cp.cuda.Device()', setup='import cupy as cp')
0.06881106700166129
>>> timeit.timeit('device = ccx.Device()', setup='import cuda.core.experimental as ccx')
0.5686513699911302

>>> timeit.timeit('cp.cuda.Stream()', setup='import cupy as cp')
1.0035127629962517
>>> timeit.timeit('device.create_stream()', setup='import cuda.core.experimental as ccx; device = ccx.Device(); device.set_current()')
5.299269804003416

>>> timeit.timeit('cp.cuda.Event()', setup='import cupy as cp')
0.393913417996373
>>> timeit.timeit('device.create_event()', setup='import cuda.core.experimental as ccx; device = ccx.Device(); device.set_current()')
3.1525100879953243

This has caused noticeable performance regressions in nvmath-python when transitioning from cupy.cuda to cuda.core for our benchmarks for smaller/medium size arrays or on faster devices where python overhead is more significant.

Specifically, we currently use event recording frequently in order to autoselect algorithms/plans and to wait for computation to complete before returning to the user (for host APIs), and we frequently use the Device constructor to check the current device.

Describe the solution you'd like

These init functions should just as fast as or faster than CuPy's abstractions.

Describe alternatives you've considered

Refactoring our internal implementation to call Device() less often by passing around one Device and being careful about context switching.
Using less event recording? Trying to reuse the same two Events. I don't think this is feasible.

Additional context

I doesn't seem like the best long term solution for nvmath-python to try to work around these issues.

The text was updated successfully, but these errors were encountered:

leofang · 2025-05-24T00:52:35Z

This has caused noticeable performance regressions in nvmath-python when transitioning from cupy.cuda to cuda.core for our benchmarks for smaller/medium size arrays or on faster devices where python overhead is more significant.

We need to look at the nvmath-python perf issue more closely. Can you please share how you got the conclusion that cuda.core is the cause? Also, which version of cuda.bindings did you install? As mentioned in the meeting 12.9.0 contains quite some perf improvements.

It is known that the Device constructor can be made faster (#460), which could in turn make other things faster too, however the timing we're discussing here are all on the O(100) ns level on my machine, whereas the regression I heard during the meeting is about ~50 us IIRC. It is unclear to me if we're looking at the right bottleneck.

leofang · 2025-05-26T05:27:51Z

As noted in the OP, event creation is much slower than CuPy:

In [7]: %timeit cp.cuda.Event()
722 ns ± 2.3 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

vs

In [5]: %timeit e = dev.create_event()
4.59 μs ± 10.4 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

FWIW CuPy's timing is probably simpler to explain in a hand-waving way: It has ~200 ns spent in creating a cdef class object + ~500 ns on the actual CUDA call:

In [6]: %timeit driver.cuEventCreate(2)
434 ns ± 1.42 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

Looking into what's happening on the cuda.core side reveals some interesting surprises. For example, because we track the CUDA context in Python, we need to retrieve it before crating an event, and the retrieval alone is already taking ~1 us

In [4]: %timeit dev.context
1.04 μs ± 3.58 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

Among the ~1 us, here's a rough breakdown: about ~600 ns in the actual work

In [17]: %timeit driver.cuCtxGetCurrent()
124 ns ± 1.13 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [22]: out = driver.cuCtxGetCurrent()

In [23]: %timeit handle_return(out)
197 ns ± 0.174 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [15]: %timeit int(ctx)
61 ns ± 0.681 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [16]: %timeit Context._from_ctx(ctx, 0)
236 ns ± 0.606 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

and the remaining ~400 ns in spent on this precondition (timing a decorator is a bit hard, the easiest is to check the event creation time with/without it):

cuda-python/cuda_core/cuda/core/experimental/_device.py

Line 1058 in ed884c9

@precondition(_check_context_initialized)

which we copied from cuQuantum Python to nvmath-python to cuda.core. So about 600 ns are used by helper functions (handle_return & precondition).

I have also looked into the remaining ~3 us in creating an event. Similar to the above finding we have time consumed here and there in Python easily. It is simple to get a sense of this through line_profiler (but for hotspot analysis like this the exact timings reported by it cannot be trusted):

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   100                                               @classmethod
   101                                               @profile
   102                                               def _init(cls, device_id: int, ctx_handle: Context, options: Optional[EventOptions] = None):
   103        21          6.7      0.3      2.4          self = super().__new__(cls)
   104        21         67.8      3.2     24.0          self._mnff = Event._MembersNeededForFinalize(self, None)
   105                                           
   106        21         42.9      2.0     15.2          options = check_or_create_options(EventOptions, options, "Event options")
   107        21          2.8      0.1      1.0          flags = 0x0
   108        21          3.2      0.2      1.1          self._timing_disabled = False
   109        21          3.0      0.1      1.1          self._busy_waited = False
   110        21          4.0      0.2      1.4          if not options.enable_timing:
   111        21          5.8      0.3      2.0              flags |= driver.CUevent_flags.CU_EVENT_DISABLE_TIMING
   112        21          2.9      0.1      1.0              self._timing_disabled = True
   113        21          3.9      0.2      1.4          if options.busy_waited_sync:
   114                                                       flags |= driver.CUevent_flags.CU_EVENT_BLOCKING_SYNC
   115                                                       self._busy_waited = True
   116        21          3.2      0.2      1.1          if options.support_ipc:
   117                                                       raise NotImplementedError("WIP: https://github.com/NVIDIA/cuda-python/issues/103")
   118        21        121.5      5.8     42.9          self._mnff.handle = handle_return(driver.cuEventCreate(flags))
   119        21          3.1      0.1      1.1          self._device_id = device_id
   120        21          2.7      0.1      0.9          self._ctx_handle = ctx_handle
   121        21          9.4      0.4      3.3          return self

So about 24% of time spent on creating a destructor (_mnff), 15% on check_or_create_options (it again came from cuQuantum Python to nvmath-python to here), 43% on creating the event itself.

It seems to me we are paying the technical debts of ignoring small problem sizes. However, the above analysis still does not answer "why do we see a ~50 us overhead in nvmath-python?" AFAIK there's only one event creation per execute() (what's benchmarked in nvmath) and it should not be so costly. It'd be better to understand it first.

leofang · 2025-05-26T05:30:06Z

FWIW the stream creation performance is a lot closer

In [6]: %timeit s = dev.create_stream()
7.47 μs ± 17.6 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [7]: %timeit cp.cuda.Stream()
6.33 μs ± 102 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [8]: %timeit s = cp.cuda.Stream()
3.69 μs ± 11.8 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

carterbox · 2025-05-27T15:53:39Z

Our investigation started when Satya noticed our performance benchmarks (timed matmuls of various sizes) were slower for the upcoming release.

I then stepped through the git history since v0.3 looking at when the performance benchmark decreased. The performance drops were all correlated with the replacement of cupy abstractions with cuda.core abstractions.

Finally, I used a statistical profiler to measure where the hot spots were during the running of these benchmarks both from the previous release and in the current version. Since I was using a statistical profiler, I increased the number of benchmarking trials from 10 to 100 so that small overheads would be captured more often. I also disabled the test with autotune since I didn't care to benchmark the autotuning process. Using a host-side python profiler is fine for this case since we are trying measure the Python overhead. The benchmarks for current and previous release were run in the exact same environment, so all the device-side computation is the same.

Used the latest releases of cuda-python as of May 23, 2025.

When comparing the call trees in the profiling results, many are similar. When they aren't similar, we follow the call stack down and find a cuda.core abstraction.

I cannot publish the profiling results for an unreleased version of nvmath-python, so... check internally.

carterbox · 2025-05-30T18:37:04Z

One obvious difference between torch/cupy and cuda.core is that these high level constructs are implemented in cython (torch/cupy) instead of python (cuda.core). Is migrating these classes to cython on the roadmap? Do we have any evidence which would suggest what kind of performance benefit we would get?

leofang · 2025-06-04T02:37:12Z

@emcastillo is looking into this

emcastillo · 2025-06-04T07:30:03Z

Check #677 to see the actual bottlenecks :)

github-actions bot added the triage Needs the team's attention label May 23, 2025

leofang mentioned this issue May 24, 2025

Switch to use CUDA driver APIs in Device constructor #460

Merged

leofang added awaiting-response Further information is requested cuda.core Everything related to the cuda.core module labels May 24, 2025

leofang added the P0 High priority - Must do! label Jun 1, 2025

leofang added this to the cuda.core parking lot milestone Jun 1, 2025

leofang added enhancement Any code-related improvements and removed awaiting-response Further information is requested triage Needs the team's attention labels Jun 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FEA]: Faster initialization time for `cuda.core` abstractions #658

[FEA]: Faster initialization time for `cuda.core` abstractions #658

carterbox commented May 23, 2025 •

edited

Loading

leofang commented May 24, 2025

Uh oh!

leofang commented May 26, 2025 •

edited

Loading

Uh oh!

leofang commented May 26, 2025 •

edited

Loading

Uh oh!

carterbox commented May 27, 2025

Uh oh!

carterbox commented May 30, 2025

Uh oh!

leofang commented Jun 4, 2025

Uh oh!

emcastillo commented Jun 4, 2025

Uh oh!

[FEA]: Faster initialization time for cuda.core abstractions #658

[FEA]: Faster initialization time for cuda.core abstractions #658

Comments

carterbox commented May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Is this a duplicate?

Area

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

leofang commented May 24, 2025

Uh oh!

leofang commented May 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leofang commented May 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

carterbox commented May 27, 2025

Uh oh!

carterbox commented May 30, 2025

Uh oh!

leofang commented Jun 4, 2025

Uh oh!

emcastillo commented Jun 4, 2025

Uh oh!

[FEA]: Faster initialization time for `cuda.core` abstractions #658

[FEA]: Faster initialization time for `cuda.core` abstractions #658

carterbox commented May 23, 2025 •

edited

Loading

leofang commented May 26, 2025 •

edited

Loading

leofang commented May 26, 2025 •

edited

Loading