Skip to content

Conversation

k-artem
Copy link

@k-artem k-artem commented Jul 24, 2025

The patch delivers several fixes for building issues for CUDA part of DeepSpeed library.
Percentage of passed unit tests improved(tested on RDNA hardware, gfx110x and gfx12x) Before:
collected 5298 items / 15 skipped
2773 failed, 862 passed, 1665 skipped, 13 errors
After:
collected 5851 items / 11 skipped
4187 failed, 1373 passed, 292 skipped, 10 errors

Regarding testing of fp_quantizer(DS_BUILD_FP_QUANTIZER) via tests/unit/ops/fp_quantizer/test_fp_quant.py, this test depends on QPyTorch which should be patched before run on AMD, please apply Tiiiger/QPyTorch#71

@k-artem k-artem requested a review from hwchen2017 July 25, 2025 16:01
@k-artem k-artem force-pushed the improve_support_of_amd_hardware branch from 5851003 to 1dc6bb7 Compare July 31, 2025 10:25
@k-artem
Copy link
Author

k-artem commented Jul 31, 2025

@hwchen2017 kindly ask for review after fixed your comments.

@hwchen2017 hwchen2017 marked this pull request as draft August 1, 2025 23:21
@k-artem k-artem force-pushed the improve_support_of_amd_hardware branch from 09b1953 to f2dbbb7 Compare August 3, 2025 15:11
k-artem added 3 commits August 3, 2025 15:15
The patch delivers several fixes for building issues for CUDA part
of DeepSpeed library.
Percentage of passed unit tests improved(tested on RDNA hardware, gfx110x and gfx12x)
Before:
collected 5298 items / 15 skipped
2773 failed, 862 passed, 1665 skipped, 13 errors
After:
collected 5851 items / 11 skipped
4187 failed, 1373 passed, 292 skipped, 10 errors

Signed-off-by: Artem Kuzmitckii <[email protected]>
Signed-off-by: Artem Kuzmitckii <[email protected]>
part  2

Signed-off-by: Artem Kuzmitckii <[email protected]>
@k-artem k-artem force-pushed the improve_support_of_amd_hardware branch from f2dbbb7 to 77a7e06 Compare August 3, 2025 15:18
@k-artem k-artem marked this pull request as ready for review August 3, 2025 15:19
@k-artem k-artem requested a review from hwchen2017 August 3, 2025 15:19
@k-artem k-artem force-pushed the improve_support_of_amd_hardware branch from 45a01df to 0946828 Compare August 18, 2025 17:22
@loadams
Copy link
Collaborator

loadams commented Sep 2, 2025

@k-artem - is this ready for final review? @hwchen2017 - any remaining review requests?

Copy link
Contributor

@hwchen2017 hwchen2017 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you share the error message you get on AMD GPU and explain why these changes can fix issues? It can help us better understand this PR. Thanks!

max(TilingConfig::SMEM_SIZE_B_TILE + SMEM_SIZE_A1_TILE + SMEM_SIZE_A2_TILE,
TilingConfig::SMEM_SIZE_C_TILE);
cudaFuncSetAttribute(QUANT_GEMM_Kernel<TilingConfig, OutputDataType>,
auto kernel = QUANT_GEMM_Kernel<TilingConfig, OutputDataType>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change here seems unnecessary because this ops only runs on cuda.

Copy link
Author

@k-artem k-artem Sep 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, currently no support, can be removed or we can keep for proper cast, because both cuda and hip declares 1st arg as const void *. please share your opinion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove it for now.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the purpose of changing the file extension here? Usually .cpp extension is used for the api definition.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if bf16 support enabled it should be build using nvcc/hipcc not regular compiler which by default uses for files with cpp extensions in cpp_exetension.py. And actually, as I see it's cuda essentially.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The binding code may or may not contain cuda code. That's why people use .cpp by default. Let's make it consistent in the repo.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, let me add more details why I did rename, in case of HIP and BF16_AVAILABLE these files include cuda staff that cannot be compiled by regular compiler which selected in torch/utils/cpp_extension.py (_is_cuda_file) based on file's extension, example:

     In file included from /opt/rocm/include/hip/amd_detail/amd_warp_functions.h:27,
                       from /opt/rocm/include/hip/amd_detail/amd_hip_bf16.h:113,
                       from /opt/rocm/include/hip/hip_bf16.h:29,
                       from /myworkspace/DeepSpeed/csrc/includes/ds_kernel_utils_hip.h:18,
                       from /myworkspace/DeepSpeed/csrc/includes/quantization_hip.h:10,
                       from csrc/quantization/pt_binding_hip.cpp:11

and as result __builtin_amdgcn_* cannot be found.

name = self.NAME if name is None else name
super().__init__(name=name)
if self.is_rocm_pytorch():
self.enable_bf16 = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Signed-off-by: Artem Kuzmitckii <[email protected]>
Signed-off-by: Artem Kuzmitckii <[email protected]>
@k-artem
Copy link
Author

k-artem commented Oct 13, 2025

hi @hwchen2017 @loadams Apologies for the delay in this PR. I've updated the code according to the last set of comments. Please review the changes. I've enabled bf16 library-wide; however, I've disabled it for transformer_inference. This is because we need to explicitly declare the bf16 type inside pt_binding.cpp, which is currently not possible due to a limitation I previously mentioned. For some extensions, such as fp_quantizer, I was able to resolve the issue using a forward declaration. Thank you in advance for your review. Also I'd appreciate you share any acceptable ideas about ways to enable bf16 for transformer_inference on the library side.

@k-artem k-artem requested a review from hwchen2017 October 13, 2025 15:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants