Skip to content

[FEA]: Faster initialization time for cuda.core abstractions #658

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
1 task done
carterbox opened this issue May 23, 2025 · 7 comments
Open
1 task done

[FEA]: Faster initialization time for cuda.core abstractions #658

carterbox opened this issue May 23, 2025 · 7 comments
Labels
cuda.core Everything related to the cuda.core module enhancement Any code-related improvements P0 High priority - Must do!

Comments

@carterbox
Copy link
Contributor

carterbox commented May 23, 2025

Is this a duplicate?

Area

cuda.core

Is your feature request related to a problem? Please describe.

As mentioned in a previous issue, equivalent operations using CuPy can be significantly faster. In this issue, I am requesting that the initialization of cuda.core abstractions have less overhead. Specifically, when compared to their CuPy counterparts, the initialization of Device, Stream, and Event abstractions are slower.

>>> timeit.timeit('cp.cuda.Device()', setup='import cupy as cp')
0.06881106700166129
>>> timeit.timeit('device = ccx.Device()', setup='import cuda.core.experimental as ccx')
0.5686513699911302
>>> timeit.timeit('cp.cuda.Stream()', setup='import cupy as cp')
1.0035127629962517
>>> timeit.timeit('device.create_stream()', setup='import cuda.core.experimental as ccx; device = ccx.Device(); device.set_current()')
5.299269804003416
>>> timeit.timeit('cp.cuda.Event()', setup='import cupy as cp')
0.393913417996373
>>> timeit.timeit('device.create_event()', setup='import cuda.core.experimental as ccx; device = ccx.Device(); device.set_current()')
3.1525100879953243

This has caused noticeable performance regressions in nvmath-python when transitioning from cupy.cuda to cuda.core for our benchmarks for smaller/medium size arrays or on faster devices where python overhead is more significant.

Specifically, we currently use event recording frequently in order to autoselect algorithms/plans and to wait for computation to complete before returning to the user (for host APIs), and we frequently use the Device constructor to check the current device.

Describe the solution you'd like

These init functions should just as fast as or faster than CuPy's abstractions.

Describe alternatives you've considered

  • Refactoring our internal implementation to call Device() less often by passing around one Device and being careful about context switching.
  • Using less event recording? Trying to reuse the same two Events. I don't think this is feasible.

Additional context

I doesn't seem like the best long term solution for nvmath-python to try to work around these issues.

@github-actions github-actions bot added the triage Needs the team's attention label May 23, 2025
@leofang
Copy link
Member

leofang commented May 24, 2025

This has caused noticeable performance regressions in nvmath-python when transitioning from cupy.cuda to cuda.core for our benchmarks for smaller/medium size arrays or on faster devices where python overhead is more significant.

We need to look at the nvmath-python perf issue more closely. Can you please share how you got the conclusion that cuda.core is the cause? Also, which version of cuda.bindings did you install? As mentioned in the meeting 12.9.0 contains quite some perf improvements.

It is known that the Device constructor can be made faster (#460), which could in turn make other things faster too, however the timing we're discussing here are all on the O(100) ns level on my machine, whereas the regression I heard during the meeting is about ~50 us IIRC. It is unclear to me if we're looking at the right bottleneck.

@leofang leofang added awaiting-response Further information is requested cuda.core Everything related to the cuda.core module labels May 24, 2025
@leofang
Copy link
Member

leofang commented May 26, 2025

As noted in the OP, event creation is much slower than CuPy:

In [7]: %timeit cp.cuda.Event()
722 ns ± 2.3 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

vs

In [5]: %timeit e = dev.create_event()
4.59 μs ± 10.4 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

FWIW CuPy's timing is probably simpler to explain in a hand-waving way: It has ~200 ns spent in creating a cdef class object + ~500 ns on the actual CUDA call:

In [6]: %timeit driver.cuEventCreate(2)
434 ns ± 1.42 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

Looking into what's happening on the cuda.core side reveals some interesting surprises. For example, because we track the CUDA context in Python, we need to retrieve it before crating an event, and the retrieval alone is already taking ~1 us

In [4]: %timeit dev.context
1.04 μs ± 3.58 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

Among the ~1 us, here's a rough breakdown: about ~600 ns in the actual work

In [17]: %timeit driver.cuCtxGetCurrent()
124 ns ± 1.13 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [22]: out = driver.cuCtxGetCurrent()

In [23]: %timeit handle_return(out)
197 ns ± 0.174 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [15]: %timeit int(ctx)
61 ns ± 0.681 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [16]: %timeit Context._from_ctx(ctx, 0)
236 ns ± 0.606 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

and the remaining ~400 ns in spent on this precondition (timing a decorator is a bit hard, the easiest is to check the event creation time with/without it):

@precondition(_check_context_initialized)

which we copied from cuQuantum Python to nvmath-python to cuda.core. So about 600 ns are used by helper functions (handle_return & precondition).

I have also looked into the remaining ~3 us in creating an event. Similar to the above finding we have time consumed here and there in Python easily. It is simple to get a sense of this through line_profiler (but for hotspot analysis like this the exact timings reported by it cannot be trusted):

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   100                                               @classmethod
   101                                               @profile
   102                                               def _init(cls, device_id: int, ctx_handle: Context, options: Optional[EventOptions] = None):
   103        21          6.7      0.3      2.4          self = super().__new__(cls)
   104        21         67.8      3.2     24.0          self._mnff = Event._MembersNeededForFinalize(self, None)
   105                                           
   106        21         42.9      2.0     15.2          options = check_or_create_options(EventOptions, options, "Event options")
   107        21          2.8      0.1      1.0          flags = 0x0
   108        21          3.2      0.2      1.1          self._timing_disabled = False
   109        21          3.0      0.1      1.1          self._busy_waited = False
   110        21          4.0      0.2      1.4          if not options.enable_timing:
   111        21          5.8      0.3      2.0              flags |= driver.CUevent_flags.CU_EVENT_DISABLE_TIMING
   112        21          2.9      0.1      1.0              self._timing_disabled = True
   113        21          3.9      0.2      1.4          if options.busy_waited_sync:
   114                                                       flags |= driver.CUevent_flags.CU_EVENT_BLOCKING_SYNC
   115                                                       self._busy_waited = True
   116        21          3.2      0.2      1.1          if options.support_ipc:
   117                                                       raise NotImplementedError("WIP: https://github.com/NVIDIA/cuda-python/issues/103")
   118        21        121.5      5.8     42.9          self._mnff.handle = handle_return(driver.cuEventCreate(flags))
   119        21          3.1      0.1      1.1          self._device_id = device_id
   120        21          2.7      0.1      0.9          self._ctx_handle = ctx_handle
   121        21          9.4      0.4      3.3          return self

So about 24% of time spent on creating a destructor (_mnff), 15% on check_or_create_options (it again came from cuQuantum Python to nvmath-python to here), 43% on creating the event itself.

It seems to me we are paying the technical debts of ignoring small problem sizes. However, the above analysis still does not answer "why do we see a ~50 us overhead in nvmath-python?" AFAIK there's only one event creation per execute() (what's benchmarked in nvmath) and it should not be so costly. It'd be better to understand it first.

@leofang
Copy link
Member

leofang commented May 26, 2025

FWIW the stream creation performance is a lot closer

In [6]: %timeit s = dev.create_stream()
7.47 μs ± 17.6 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [7]: %timeit cp.cuda.Stream()
6.33 μs ± 102 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [8]: %timeit s = cp.cuda.Stream()
3.69 μs ± 11.8 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

@carterbox
Copy link
Contributor Author

Our investigation started when Satya noticed our performance benchmarks (timed matmuls of various sizes) were slower for the upcoming release.

I then stepped through the git history since v0.3 looking at when the performance benchmark decreased. The performance drops were all correlated with the replacement of cupy abstractions with cuda.core abstractions.

Finally, I used a statistical profiler to measure where the hot spots were during the running of these benchmarks both from the previous release and in the current version. Since I was using a statistical profiler, I increased the number of benchmarking trials from 10 to 100 so that small overheads would be captured more often. I also disabled the test with autotune since I didn't care to benchmark the autotuning process. Using a host-side python profiler is fine for this case since we are trying measure the Python overhead. The benchmarks for current and previous release were run in the exact same environment, so all the device-side computation is the same.

Used the latest releases of cuda-python as of May 23, 2025.

When comparing the call trees in the profiling results, many are similar. When they aren't similar, we follow the call stack down and find a cuda.core abstraction.

I cannot publish the profiling results for an unreleased version of nvmath-python, so... check internally.

@carterbox
Copy link
Contributor Author

One obvious difference between torch/cupy and cuda.core is that these high level constructs are implemented in cython (torch/cupy) instead of python (cuda.core). Is migrating these classes to cython on the roadmap? Do we have any evidence which would suggest what kind of performance benefit we would get?

@leofang leofang added the P0 High priority - Must do! label Jun 1, 2025
@leofang leofang added this to the cuda.core parking lot milestone Jun 1, 2025
@leofang
Copy link
Member

leofang commented Jun 4, 2025

@emcastillo is looking into this

@leofang leofang added enhancement Any code-related improvements and removed awaiting-response Further information is requested triage Needs the team's attention labels Jun 4, 2025
@emcastillo
Copy link

Check #677 to see the actual bottlenecks :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda.core Everything related to the cuda.core module enhancement Any code-related improvements P0 High priority - Must do!
Projects
None yet
Development

No branches or pull requests

3 participants