-
Notifications
You must be signed in to change notification settings - Fork 65
Block Chain Scan #884
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: improve-workgroup-scan-2
Are you sure you want to change the base?
Block Chain Scan #884
Conversation
namespace scan | ||
{ | ||
|
||
template<class Config, class BinOp, bool ForwardProgressGuarantees, class device_capabilities=void> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should make a fake device feature called forwardProgressGuarantees
which is basically always false
template<typename T> // only uint32_t or uint64_t for now? | ||
struct Constants | ||
{ | ||
NBL_CONSTEXPR_STATIC_INLINE T NOT_READY = 0; | ||
NBL_CONSTEXPR_STATIC_INLINE T LOCAL_COUNT = T(0x1u) << (sizeof(T)*8-2); | ||
NBL_CONSTEXPR_STATIC_INLINE T GLOBAL_COUNT = T(0x1u) << (sizeof(T)*8-1); | ||
NBL_CONSTEXPR_STATIC_INLINE T STATUS_MASK = LOCAL_COUNT | GLOBAL_COUNT; | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can use enum class
if you update DXC btw also with : uint16_t
or : uint64_t
not everything needs to be a crazy template
scalar_t __call(NBL_REF_ARG(DataAccessor) dataAccessor, NBL_REF_ARG(ScratchAccessor) sharedMemScratchAccessor) | ||
{ | ||
const scalar_t localReduction = workgroup_reduce_t::__call<DataAccessor, ScratchAccessor>(dataAccessor, sharedMemScratchAccessor); | ||
bda::__ptr<T> scratch = dataAccessor.getScratchPtr(); // scratch data should be at least T[NumWorkgroups] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ask for a separate accessor for the workgroup flags
Also 32bit atomics are always present and probably the cheapest, there's no harm in packing 16 workgroup flags into the same uint32_t
but please benchmark if its a net gain (more likely to sit in GPU's L2 cache) or loss (contention)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aaah that might not be compatible with using UMax
though
template<class ReadOnlyDataAccessor, class ScratchAccessor NBL_FUNC_REQUIRES(workgroup2::ArithmeticReadOnlyDataAccessor<ReadOnlyDataAccessor,scalar_t> && workgroup2::ArithmeticSharedMemoryAccessor<ScratchAccessor,scalar_t>) | ||
static scalar_t __call(NBL_REF_ARG(ReadOnlyDataAccessor) dataAccessor, NBL_REF_ARG(ScratchAccessor) sharedMemScratchAccessor) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need a 3rd accessor which is an atomic accessor (to both accumulate the result and figure out when everyone is done)
// get last item from scratch | ||
const uint32_t lastWorkgroup = glsl::gl_NumWorkGroups().x - 1; | ||
bda::__ref<scalar_t> scratchLast = (scratch + lastWorkgroup).deref(); | ||
scalar_t value = constants_t::NOT_READY; | ||
if (lastInvocation) | ||
{ | ||
// wait until last workgroup does reduction | ||
while (!(value & constants_t::GLOBAL_COUNT)) | ||
{ | ||
// value = spirv::atomicLoad(scratchLast.__get_spv_ptr(), spv::ScopeWorkgroup, spv::MemorySemanticsAcquireMask); | ||
value = spirv::atomicIAdd(scratchLast.__get_spv_ptr(), spv::ScopeWorkgroup, spv::MemorySemanticsAcquireMask, 0u); | ||
} | ||
} | ||
value = workgroup::Broadcast(value, sharedMemScratchAccessor, Config::WorkgroupSize-1); | ||
return value & (~constants_t::STATUS_MASK); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't work even with forward progress guarantees, you just need to let the workgroup quit
template<class DataAccessor, class ScratchAccessor> | ||
scalar_t __call(NBL_REF_ARG(DataAccessor) dataAccessor, NBL_REF_ARG(ScratchAccessor) sharedMemScratchAccessor) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you have the readonly accessor to get your element, and you have the scratch memory accessor (for the workgroup scans/reductions) but you don't have:
- accessor for the Device-Scope scratch
- accessor for where to store the reduction result (reduction is special compared to a scan, you can't get the result right away)
if (lastInvocation) | ||
{ | ||
bda::__ref<scalar_t> scratchId = (scratch + glsl::gl_WorkGroupID().x).deref(); | ||
spirv::atomicUMax(scratchId.__get_spv_ptr(), spv::ScopeWorkgroup, spv::MemorySemanticsReleaseMask, localReduction|constants_t::LOCAL_COUNT); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you want to separate the storage of the reduction from the flags I think.
for (uint32_t i = 1; i <= glsl::gl_WorkGroupID().x; i++) | ||
{ | ||
const uint32_t prevID = glsl::gl_WorkGroupID().x-i; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't use gl_WorkGroupID
ask for the virtualWorkgroupIndex
in the function call
// value = spirv::atomicLoad(scratchPrev.__get_spv_ptr(), spv::ScopeWorkgroup, spv::MemorySemanticsAcquireMask); | ||
value = spirv::atomicIAdd(scratchPrev.__get_spv_ptr(), spv::ScopeWorkgroup, spv::MemorySemanticsAcquireMask, 0u); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you'll have multiple workgroups doing this, you'll mess up the results, you want to accumulate those locally here in a register
so while you're walking backwards, you only do prefix += atomicLoad(scratchPrev.__get_spv_ptr(), spv::ScopeDevice, MakeVisible);
Also this requires that you have two different scratch store locations:
- one for local reduction
- one for global reduction
If you keep the global and local on the same address, you get nasty data races because the status is a flag and not a mutex, so you can overwrite a local result with a global while another workgroup reads and it thinks that the value it reads is a local result because flag is not updated yet.
P.S. You probably don't want to be writing out the GLOBAL results and updating status flags here even though you can, because other workgroups before you are obviously "just about" to write out their results, and this way you'll just introduce more uncached memory traffic.
if (lastInvocation) // don't make whole block work and do busy stuff | ||
{ | ||
// for (uint32_t prevID = glsl::gl_WorkGroupID().x-1; prevID >= 0u; prevID--) // won't run properly this way for some reason, results in device lost | ||
for (uint32_t i = 1; i <= glsl::gl_WorkGroupID().x; i++) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually using the whole workgroup or at least subgroup (benchmark it) would be much faster here, so each invocation checks a workgroup and you can use workgroup2::reduce
with a Max binop to find the first preceeding workgroup with a ready GLOBAL scan value
You'd also be able to accumulate the prefix
faster over the ones which have LOCAL ready
Description
Implementation of Block Chain Scan
Testing
Example XXX
TODO list:
We'll let @kpentaris review.