Skip to content

Block Chain Scan #884

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 17 commits into
base: improve-workgroup-scan-2
Choose a base branch
from

Conversation

devshgraphicsprogramming
Copy link
Member

Description

Implementation of Block Chain Scan

Testing

Example XXX

TODO list:

We'll let @kpentaris review.

namespace scan
{

template<class Config, class BinOp, bool ForwardProgressGuarantees, class device_capabilities=void>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should make a fake device feature called forwardProgressGuarantees which is basically always false

Comment on lines +32 to +39
template<typename T> // only uint32_t or uint64_t for now?
struct Constants
{
NBL_CONSTEXPR_STATIC_INLINE T NOT_READY = 0;
NBL_CONSTEXPR_STATIC_INLINE T LOCAL_COUNT = T(0x1u) << (sizeof(T)*8-2);
NBL_CONSTEXPR_STATIC_INLINE T GLOBAL_COUNT = T(0x1u) << (sizeof(T)*8-1);
NBL_CONSTEXPR_STATIC_INLINE T STATUS_MASK = LOCAL_COUNT | GLOBAL_COUNT;
};
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use enum class if you update DXC btw also with : uint16_t or : uint64_t

not everything needs to be a crazy template

scalar_t __call(NBL_REF_ARG(DataAccessor) dataAccessor, NBL_REF_ARG(ScratchAccessor) sharedMemScratchAccessor)
{
const scalar_t localReduction = workgroup_reduce_t::__call<DataAccessor, ScratchAccessor>(dataAccessor, sharedMemScratchAccessor);
bda::__ptr<T> scratch = dataAccessor.getScratchPtr(); // scratch data should be at least T[NumWorkgroups]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ask for a separate accessor for the workgroup flags

Also 32bit atomics are always present and probably the cheapest, there's no harm in packing 16 workgroup flags into the same uint32_t but please benchmark if its a net gain (more likely to sit in GPU's L2 cache) or loss (contention)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aaah that might not be compatible with using UMax though

Comment on lines 21 to 22
template<class ReadOnlyDataAccessor, class ScratchAccessor NBL_FUNC_REQUIRES(workgroup2::ArithmeticReadOnlyDataAccessor<ReadOnlyDataAccessor,scalar_t> && workgroup2::ArithmeticSharedMemoryAccessor<ScratchAccessor,scalar_t>)
static scalar_t __call(NBL_REF_ARG(ReadOnlyDataAccessor) dataAccessor, NBL_REF_ARG(ScratchAccessor) sharedMemScratchAccessor)
Copy link
Member Author

@devshgraphicsprogramming devshgraphicsprogramming Jun 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need a 3rd accessor which is an atomic accessor (to both accumulate the result and figure out when everyone is done)

Comment on lines 153 to 167
// get last item from scratch
const uint32_t lastWorkgroup = glsl::gl_NumWorkGroups().x - 1;
bda::__ref<scalar_t> scratchLast = (scratch + lastWorkgroup).deref();
scalar_t value = constants_t::NOT_READY;
if (lastInvocation)
{
// wait until last workgroup does reduction
while (!(value & constants_t::GLOBAL_COUNT))
{
// value = spirv::atomicLoad(scratchLast.__get_spv_ptr(), spv::ScopeWorkgroup, spv::MemorySemanticsAcquireMask);
value = spirv::atomicIAdd(scratchLast.__get_spv_ptr(), spv::ScopeWorkgroup, spv::MemorySemanticsAcquireMask, 0u);
}
}
value = workgroup::Broadcast(value, sharedMemScratchAccessor, Config::WorkgroupSize-1);
return value & (~constants_t::STATUS_MASK);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't work even with forward progress guarantees, you just need to let the workgroup quit

Comment on lines +52 to +53
template<class DataAccessor, class ScratchAccessor>
scalar_t __call(NBL_REF_ARG(DataAccessor) dataAccessor, NBL_REF_ARG(ScratchAccessor) sharedMemScratchAccessor)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you have the readonly accessor to get your element, and you have the scratch memory accessor (for the workgroup scans/reductions) but you don't have:

  • accessor for the Device-Scope scratch
  • accessor for where to store the reduction result (reduction is special compared to a scan, you can't get the result right away)

if (lastInvocation)
{
bda::__ref<scalar_t> scratchId = (scratch + glsl::gl_WorkGroupID().x).deref();
spirv::atomicUMax(scratchId.__get_spv_ptr(), spv::ScopeWorkgroup, spv::MemorySemanticsReleaseMask, localReduction|constants_t::LOCAL_COUNT);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you want to separate the storage of the reduction from the flags I think.

Comment on lines +84 to +86
for (uint32_t i = 1; i <= glsl::gl_WorkGroupID().x; i++)
{
const uint32_t prevID = glsl::gl_WorkGroupID().x-i;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't use gl_WorkGroupID ask for the virtualWorkgroupIndex in the function call

Comment on lines 93 to 94
// value = spirv::atomicLoad(scratchPrev.__get_spv_ptr(), spv::ScopeWorkgroup, spv::MemorySemanticsAcquireMask);
value = spirv::atomicIAdd(scratchPrev.__get_spv_ptr(), spv::ScopeWorkgroup, spv::MemorySemanticsAcquireMask, 0u);
Copy link
Member Author

@devshgraphicsprogramming devshgraphicsprogramming Jun 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you'll have multiple workgroups doing this, you'll mess up the results, you want to accumulate those locally here in a register

so while you're walking backwards, you only do prefix += atomicLoad(scratchPrev.__get_spv_ptr(), spv::ScopeDevice, MakeVisible);

Also this requires that you have two different scratch store locations:

  • one for local reduction
  • one for global reduction

If you keep the global and local on the same address, you get nasty data races because the status is a flag and not a mutex, so you can overwrite a local result with a global while another workgroup reads and it thinks that the value it reads is a local result because flag is not updated yet.

P.S. You probably don't want to be writing out the GLOBAL results and updating status flags here even though you can, because other workgroups before you are obviously "just about" to write out their results, and this way you'll just introduce more uncached memory traffic.

Comment on lines +81 to +84
if (lastInvocation) // don't make whole block work and do busy stuff
{
// for (uint32_t prevID = glsl::gl_WorkGroupID().x-1; prevID >= 0u; prevID--) // won't run properly this way for some reason, results in device lost
for (uint32_t i = 1; i <= glsl::gl_WorkGroupID().x; i++)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually using the whole workgroup or at least subgroup (benchmark it) would be much faster here, so each invocation checks a workgroup and you can use workgroup2::reduce with a Max binop to find the first preceeding workgroup with a ready GLOBAL scan value

You'd also be able to accumulate the prefix faster over the ones which have LOCAL ready

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants