Skip to content

Conversation

@vpietila-amd
Copy link
Contributor

Proposed changes

Added concepts and builder functionality for building forward convolutions from CK library. The limitation of the current implementation is that the convolution specialization is hard-coded to DeviceGroupedConvFwdMultipleABD_Xdl_CShuffle_V3. The implementation will be generalized to other specializations in a later PR. Added unit tests to verify that the builder is building valid instances. Most of the checking for parameters is done at compile time.

The implementation in this PR builds upon the prototype from this branch: https://github.com/ROCm/composable_kernel/tree/jshumway/convolution-builder

Checklist

Please put an x into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.

  • I have added tests relevant to the introduced functionality, and the unit tests are passing locally
  • I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run.
  • I have added inline documentation which enables the maintainers with understanding the motivation
  • I have removed the stale documentation which is no longer relevant after this pull request
  • (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request
  • I have run clang-format on all changed files
  • Any dependent changes have been merged

Discussion

shumway and others added 24 commits October 17, 2025 03:40
- Add experimental/builder directory with README documentation.
- Create initial test infrastructure with CMakeLists.txt and placeholder test.
- Update root CMakeLists.txt to support CK_EXPERIMENTAL_BUILDER option.
- Update .gitignore to not treat `experimental/builder` as a CMake build directory.

This establishes the directory structure  for a high-level builder pattern that will provide a semantically-clear interface for constructing CK operations, with initial focus on convolution kernels for MIOpen integration.
- Add experimental/builder CMakeLists.txt with proper subdirectory structure
- Add placeholder include/ck_tile/builder CMakeLists.txt for header installation
- Fix gtest.cmake to use include_guard to prevent multiple inclusions
- Update root CMakeLists.txt to include full builder directory instead of just tests
case DataType::FP32: return "FP32";
case DataType::BF16: return "BF16";
case DataType::FP8: return "FP8";
case DataType::I8: return "I8";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is "S8" more common (signed eight bit integer)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to S8. Although we are not yet using the type.

{
switch(layout)
{
case GroupConvLayout::CHANNELS_FIRST: return "Channels-first (NCHW)";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been trying to figure this out. I really like channels-first and channels-last for convolutions, but it looks like grouped convolution turn this into an alphabet soup. I think the filter in grouped convolutions can still be channels-first or channels-last, but the image and feature map (input and output) tensors appear to have a lot of different conventions where the group index fits in the layout. This is OK for now, but we probably want to think about how best to describe these layouts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I revised the implementation and introduced separate conv layouts enums GroupConvLayout1D, GroupConvLayout2D, and GroupConvLayout3D. They indicate explicitly the layout of the input, filter, and output tensors. I'm wondering if they should be merged into a single enum since they carry a bit redundant dimension information. What do you think?

case ConvFwdSpecialization::FILTER_3x3:
return ck::tensor_operation::device::ConvolutionForwardSpecialization::Filter3x3;
case ConvFwdSpecialization::DEFAULT:
default: return ck::tensor_operation::device::ConvolutionForwardSpecialization::Default;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Robin suggested removing the "default:", making the function consteval, and throwing an error string for unexpected values. That way we get a compile time error if we have an unsupported input value. In principle the compiler will also error if we omit "default:" and miss a value, but the consteval + throw pattern is robust if we are only compile-time usage. The downside is the function can only be used at compile time, but that may be correct for this code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make all factory internal function consteval rather than constexpr? At least the ones that use switch?

vpietila-amd and others added 3 commits October 22, 2025 10:11
  - Add missing 2D layouts: GNHWC_GKYXC_GNHWK, NGCHW_GKCYX_NGKHW
  - Add missing 3D layout: GNDHWC_GKZYXC_GNDHWK
  - Add 1D layouts (NWGC, NGCW, GNWC, NGCW_GKCX) for future support
  - Add 3 tests for new 2D/3D layouts
  - All tests pass (5/5)
  - Add test for 2D NGCHW_GKYXC_NGKHW (channels-first) with Filter1x1Stride1Pad0
  - Add test for 3D NDHWGC_GKZYXC_NDHWGK (channels-last)
  - All 7 tests pass (complete coverage for all 2D/3D forward layouts)
spolifroni-amd
spolifroni-amd previously approved these changes Oct 23, 2025
{ t.src_vector_dim } -> std::convertible_to<size_t>;
{ t.src_scalar_per_vector } -> std::convertible_to<size_t>;
{ t.dest_scalar_per_vector_k1 } -> std::convertible_to<size_t>;
{ t.add_extra } -> std::convertible_to<bool>;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be name lds_padding

Copy link
Contributor

@bartekxk bartekxk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very nice

concept InputVectorTransferDescriptor = requires(T t) {
{ t.src_vector_dim } -> std::convertible_to<size_t>;
{ t.src_scalar_per_vector } -> std::convertible_to<size_t>;
{ t.dest_scalar_per_vector_k1 } -> std::convertible_to<size_t>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe lds_dst_scalar_per_vector is better name

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed more descriptive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed as suggested.

template <typename T>
concept InputVectorTransferDescriptor = requires(T t) {
{ t.src_vector_dim } -> std::convertible_to<size_t>;
{ t.src_scalar_per_vector } -> std::convertible_to<size_t>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add boolean IsDirectLoad? It will be introduced in #3082

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added requirement for is_direct_load member variable.

{
return ck::tensor_operation::device::ConvolutionForwardSpecialization::Filter1x1Stride1Pad0;
}
else if constexpr(specialization == ConvFwdSpecialization::ODD_C)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can delete this specialization since this is duplication of ODD_C. These instances should be dleted in #2281

FP16,
BF16,
FP8,
S8
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I8 instead of S8?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had initially I8, but @shumway was thinking that S8 is more standard for signed 8-bit integer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the general convention is I for signed, U for unsigned

Here's a similar topic

enum class BlockGemmPipelineVersion
{
V1,
V3,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missed V2

// Fused element-wise operations.
enum class ElementwiseOperation
{
BIAS,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missed BIAS_BNORM_CLAMP

Comment on lines +45 to +51
concept ConvSignatureDescriptor = requires(T t) {
{ t.spatial_dim } -> std::convertible_to<unsigned int>;
{ t.direction } -> std::convertible_to<ConvDirection>;
requires std::convertible_to<decltype(t.layout), GroupConvLayout1D> ||
std::convertible_to<decltype(t.layout), GroupConvLayout2D> ||
std::convertible_to<decltype(t.layout), GroupConvLayout3D>;
{ t.data_type } -> std::convertible_to<DataType>;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about fused (or actually elementwise, cause it can be applied on inputs as well) op? Should we actually check it ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants