-
Is this a duplicate?
AreaThrust Is your feature request related to a problem? Please describe.ccl/cub/cub/block/block_radix_rank.cuh I assume that we can reduce some shared memory access in function ScanCounters. Will it result in some unexpected behavior? Describe the solution you'd likechange
to
Describe alternatives you've consideredNo response Additional contextNo response |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
Thank you for your suggestion, @hlyix! While an early return might seem appealing for optimization, warp divergence becomes a concern in this context. If threads within the same warp take divergent execution paths (some returning early while others proceed), this serializes instruction execution across the warp, potentially negating any performance gains and even causing regressions. More importantly, the proposed change would compromise correctness: The |
Beta Was this translation helpful? Give feedback.
Thank you for your suggestion, @hlyix!
While an early return might seem appealing for optimization, warp divergence becomes a concern in this context. If threads within the same warp take divergent execution paths (some returning early while others proceed), this serializes instruction execution across the warp, potentially negating any performance gains and even causing regressions.
More importantly, the proposed change would compromise correctness: The
ExclusiveDownsweep
operation isn't just a conditional scan, but it also integrates theexclusive_partial
value into the thread's items.