Enhancing mapreduce and avoiding kernel setup overhead #2702

epilliat · 2025-03-19T14:31:51Z

epilliat
Mar 19, 2025

Hello everyone,

I've been recently exploring CUDA.jl and developed a custom mapreduce implementation, available at: https://github.com/epilliat/Luma (first version, still experimental). I may plan to revisit accumulate, and sort functions as well before moving to some graph algorithms.

Current Features & Performance

Currently works with 1D vectors
Benchmarks show significant performance improvements over the standard mapreduce function and similar performance CuBlas (tested with + and * operators, effectively benchmarking dot products on vectors)

Implementation Details

The main idea with respect to the current mapreduce function I guess is communication between blocks using flags and threadfence() for the last step instead of synchronizing all blocks. I use a random "target_flag" so that the kernel can be relaunched with very small proba of error with same global memory.

I've taken care to measure overhead fairly, as kernel setup costs (thread/block calculation, global memory allocation) becomes non-negligible for these types of kernel. To minimize this overhead, I encapsulated the mapreduce functionality in a MapReduce mutable struct that preserves:

The compiled kernel
Global memory allocations
Function operators (f, op)
Data type information

Future Direction & Questions

I'm considering refactoring to use generated functions that compute kernels at compile time based on datatypes. However, I'm uncertain if @cuda launch=false kernel(args...) evaluates just the types or also the arguments themselves. Concerning global memory the user could just send it into the arguments to optimize computational cost.

If it also evaluates arguments, we could potentially use Val{K} where 2^K <= length(V) < 2^{K+1} to determine the data size magnitude at compile time. Has anyone explored this approach?

Feedback and suggestions welcome!

maleadt · 2025-03-19T19:06:37Z

maleadt
Mar 19, 2025
Maintainer

I'm uncertain if @cuda launch=false kernel(args...) evaluates just the types or also the arguments themselves.

I'm not sure I understand this question, can you elaborate? @cuda kernel(args...) behaves like an ordinary function call, specializing for the run-time types of args.

I'm considering refactoring to use generated functions that compute kernels at compile time based on datatypes.

Without knowing more of what you intend, I would recommend against relying on generated functions, especially when things can be expressed in normal ways using types and typevars.

Regarding the work here: Exciting to see a fundamental kernel getting performance improvements! We've recently been moving towards (re)implementing kernels using KernelAbstractions.jl, so also take a look at the reduce kernels in https://github.com/JuliaGPU/AcceleratedKernels.jl, and e.g. the work in JuliaGPU/KernelAbstractions.jl#559. It'd be exciting if these performance improvements could be made accessible to other back-ends, which from a cursory look should be possible for the version you've implemented (with the KA.jl PR linked above providing shfl, and fences being available in UnsaveAtomics.jl).

2 replies

epilliat Mar 20, 2025
Author

I'm not sure I understand this question, can you elaborate? @cuda kernel(args...) behaves like an ordinary function call, specializing for the run-time types of args.
Without knowing more of what you intend, I would recommend against relying on generated functions, especially when things can be expressed in normal ways using types and typevars.

Yes it was unclear sorry. I wonder to what extent the output of the specialized function @cuda kernel(args...) depends on the arguments (like the bitsize of the inputs) and not just the types of the arguments. Because if it depends only depends on the type (as it looks like experimentally), then we could just create arbitrary (with undef elements) CuArray at compile time (so in a generated function) to compute threads and blocks at compile times and eventually compute true_threads = min(threads, length(Data)) at runtime.
This is because i'm frustrated that
@cuda launch = false together with launch_configuration takes 20% of the time for dot product with 10^6 Foat64... and that I cannot see any other way except using a callable mutable struct, which would raise issues I guess if we wanted to put the code in an already existing library.

Regarding the work here: Exciting to see a fundamental kernel getting performance improvements! We've recently been moving towards (re)implementing kernels using KernelAbstractions.jl, so also take a look at the reduce kernels in https://github.com/JuliaGPU/AcceleratedKernels.jl, and e.g. the work in JuliaGPU/KernelAbstractions.jl#559. It'd be exciting if these performance improvements could be made accessible to other back-ends, which from a cursory look should be possible for the version you've implemented (with the KA.jl PR linked above providing shfl, and fences being available in UnsaveAtomics.jl).

I will definitely take a look. I tried KA but thought it was too high level and struggled to understand how it was working internally.

maleadt Apr 9, 2025
Maintainer

I wonder to what extent the output of the specialized function @cuda kernel(args...) depends on the arguments (like the bitsize of the inputs) and not just the types of the arguments.

I'm still not clear on what you're getting at here. The specialized function is keyed on the types of the arguments. It's not possible to only key on the size of the types, because the generated code obviously is very dependent on the type of the arguments, and that in turn determines e.g. the register usage of the kernel, which in turn affects occupancy.

This is because i'm frustrated that
@cuda launch = false together with launch_configuration takes 20% of the time

Did you profile that? The @cuda launch = false is a simple hash lookup (once the kernel has been compiled and cached), so should be essentially free. The launch_configuration is a simple C API call, so I can't imagine that being significant given the about 20us or si cost of subsequently launching a kernel.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enhancing mapreduce and avoiding kernel setup overhead #2702

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Enhancing mapreduce and avoiding kernel setup overhead #2702

Uh oh!

epilliat Mar 19, 2025

Current Features & Performance

Implementation Details

Future Direction & Questions

Replies: 1 comment · 2 replies

Uh oh!

maleadt Mar 19, 2025 Maintainer

Uh oh!

epilliat Mar 20, 2025 Author

Uh oh!

maleadt Apr 9, 2025 Maintainer

epilliat
Mar 19, 2025

Replies: 1 comment 2 replies

maleadt
Mar 19, 2025
Maintainer

epilliat Mar 20, 2025
Author

maleadt Apr 9, 2025
Maintainer