Enhancing mapreduce and avoiding kernel setup overhead #2702
Replies: 1 comment 2 replies
-
I'm not sure I understand this question, can you elaborate?
Without knowing more of what you intend, I would recommend against relying on generated functions, especially when things can be expressed in normal ways using types and typevars. Regarding the work here: Exciting to see a fundamental kernel getting performance improvements! We've recently been moving towards (re)implementing kernels using KernelAbstractions.jl, so also take a look at the reduce kernels in https://github.com/JuliaGPU/AcceleratedKernels.jl, and e.g. the work in JuliaGPU/KernelAbstractions.jl#559. It'd be exciting if these performance improvements could be made accessible to other back-ends, which from a cursory look should be possible for the version you've implemented (with the KA.jl PR linked above providing |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello everyone,
I've been recently exploring CUDA.jl and developed a custom mapreduce implementation, available at: https://github.com/epilliat/Luma (first version, still experimental). I may plan to revisit accumulate, and sort functions as well before moving to some graph algorithms.
Current Features & Performance
mapreduce
function and similar performance CuBlas (tested with+
and*
operators, effectively benchmarking dot products on vectors)Implementation Details
The main idea with respect to the current mapreduce function I guess is communication between blocks using flags and threadfence() for the last step instead of synchronizing all blocks. I use a random "target_flag" so that the kernel can be relaunched with very small proba of error with same global memory.
I've taken care to measure overhead fairly, as kernel setup costs (thread/block calculation, global memory allocation) becomes non-negligible for these types of kernel. To minimize this overhead, I encapsulated the mapreduce functionality in a
MapReduce
mutable struct that preserves:Future Direction & Questions
I'm considering refactoring to use generated functions that compute kernels at compile time based on datatypes. However, I'm uncertain if
@cuda launch=false kernel(args...)
evaluates just the types or also the arguments themselves. Concerning global memory the user could just send it into the arguments to optimize computational cost.If it also evaluates arguments, we could potentially use
Val{K}
where2^K <= length(V) < 2^{K+1}
to determine the data size magnitude at compile time. Has anyone explored this approach?Feedback and suggestions welcome!
Beta Was this translation helpful? Give feedback.
All reactions