AITER Native Padding Support and BSHD + Padding --> THD + Padding conversion #354

Micky774 · 2025-10-28T18:16:56Z

Description

Feature update PR which includes several iterative changes for client-driven optimization targets. This PR includes both API changes for CK/AITER as well as changes in internal integration. See the list of changes for specifics.

Note that this will not be ready for merger until ROCm/aiter#1212 is merged in and this PR's AITER commit is updated.

Fixes # (issue)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Integrated support for native padding kernels in fwd/bwd
Added BSHD + Padding --> THD + Padding conversion mechanism
Streamlined memory allocation logic
Added runtime max_seqlen calculation gated by new env var NVTE_CK_RUNTIME_MAX_SEQLEN
Adds v3_api_check support (temporary)
Implements new AITER/CK API
Update MQA post-processing kernels
Remove pad_between_seqs (need to follow-up with a PR cleaning up test suite for old pad_between_seqs edge-cases)
Added NVTE_CK_RUNTIME_NUM_SEGMENTS to guard runtime-calculation of the number of segments in the JAX integration

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

…wd-bshd-thd

[ROCm] support v2 bwd native padding

transformer_engine/common/ck_fused_attn/src/ck_fused_attn_bwd.cpp

transformer_engine/common/fused_attn_rocm/fused_attn_ck.cpp

transformer_engine/jax/csrc/extensions/attention.cpp

wangye805

Generally, I think we can try to remove all memset except for dq, dq_acc. We can confirm with aiter/ck people

transformer_engine/common/fused_attn_rocm/fused_attn_ck.cpp

transformer_engine/common/ck_fused_attn/src/ck_fused_attn_bwd.cpp

transformer_engine/common/ck_fused_attn/src/ck_fused_attn_fwd.cpp

transformer_engine/common/fused_attn_rocm/fused_attn_ck.cpp

transformer_engine/jax/csrc/extensions/attention.cpp

wangye805 · 2025-11-10T15:37:00Z

Let's also add how to use the runtime segment/max seqlen in readme under https://github.com/ROCm/TransformerEngine?tab=readme-ov-file#fused-attention-backends-on-rocm. Remind our customers that this will break the cudagraph

Micky774 · 2025-11-10T18:31:18Z

Let's also add how to use the runtime segment/max seqlen in readme under https://github.com/ROCm/TransformerEngine?tab=readme-ov-file#fused-attention-backends-on-rocm. Remind our customers that this will break the cudagraph

@wangye805 I've now updated the readme, but let me know if you have specific thoughts on it.

wangye805

Please take a look at several unresolved conversation previously

transformer_engine/common/fused_attn_rocm/fused_attn_ck.cpp

- Updated debug message for BSHD-->THD conversion - Added env variable to gate FWD output memset for padding - Removed guards on memsets for d{Q,K,V} matrices

wenchenvincent · 2025-11-11T06:02:45Z

@Micky774 Could you rebase/merge latest dev to incorporate the hot fixes for sgpu tests?

transformer_engine/common/fused_attn_rocm/fused_attn_ck.cpp

wangye805 · 2025-11-13T16:02:03Z

pytorch test_numerics also shows some fused-attn related failures:
FAILED tests/pytorch/test_numerics.py::test_kv_cache_accuracy[False-FusedAttention-TransformerLayer-sbhd-False-126m-1-dtype1] - AssertionError: Outputs not close enough in tensor at idx=0. Maximum difference at location [0, 650] with -0.90625 vs 0.5654296875 (diff 1.4716796875).

Not sure whether this is related to our decision to remove memsettings.

Micky774 · 2025-11-14T21:41:45Z

pytorch test_numerics also shows some fused-attn related failures: FAILED tests/pytorch/test_numerics.py::test_kv_cache_accuracy[False-FusedAttention-TransformerLayer-sbhd-False-126m-1-dtype1] - AssertionError: Outputs not close enough in tensor at idx=0. Maximum difference at location [0, 650] with -0.90625 vs 0.5654296875 (diff 1.4716796875).

Not sure whether this is related to our decision to remove memsettings.

Those failures were due to a mix of not correctly dispatching to the is_SBHD workflow when dealing with SBHD_2BSHD formats, and miscalculating stride in the case of the same format. Resolved now.

wangye805

For those newly added hybrid qkv formats in upstream (NVTE_SBHD_2BSHD, NVTE_BSHD_2SBHD, NVTE_THD_2BSHD, and NVTE_THD_2SBHD): in addition to the SBHD_2BSHD pytest failures, are we able to correctly handle all other 3? Or is there only SBHD_2BSHD pytests now?

NV upstream is separating format and is_ragged on q/kv and do subsequent processings accordingly:

TransformerEngine/transformer_engine/common/fused_attn/fused_attn_f16_arbitrary_seqlen.cu

Lines 79 to 82 in 32e2d1d

    
           NVTE_QKV_Format q_format = nvte_get_q_format(layout); 
        
           NVTE_QKV_Format kv_format = nvte_get_kv_format(layout); 
        
           bool is_ragged_q = (q_format == NVTE_QKV_Format::NVTE_THD); 
        
           bool is_ragged_kv = (kv_format == NVTE_QKV_Format::NVTE_THD);

Maybe we can try similar technique. If I recall correctly, we need padding/unpadding for just q in SBHD_2BSHD and for just k/v in BSHD_2SBHD.

Or it's okay if you want to leave this for another PR.

By the way, there is an "extra line" comment you may have ignored :-)

transformer_engine/common/fused_attn_rocm/fused_attn_ck.cpp

transformer_engine/common/ck_fused_attn/src/ck_fused_attn_bwd.cpp

transformer_engine/common/fused_attn_rocm/fused_attn_ck.cpp

wangye805 · 2025-12-04T06:02:41Z

In fact, I saw some level 3 pytorch cp pytest failures by run level 3 ci locally:

=========================== short test summary info ============================
FAILED tests/pytorch/fused_attn/test_fused_attn_with_cp.py::test_cp_with_fused_attention[False-p2p-thd-cp_1_0-bf16] - subprocess.CalledProcessError: Command '['python3', '-m', 'torch.distributed.launch', '--nproc-per-node=2', '/workspace/te_native_bshd_thd/tests/pytorch/fused_attn/run_fused_attn_with_cp.py', 'dtype=bf16', 'model=cp_1_0', 'qkv_format=thd', 'kernel_backend=FusedAttention', 'cp_comm_type=p2p', 'fp8_mha=False']' returned non-zero exit status 1.
FAILED tests/pytorch/fused_attn/test_fused_attn_with_cp.py::test_cp_with_fused_attention[False-p2p-thd-cp_1_1-bf16] - subprocess.CalledProcessError: Command '['python3', '-m', 'torch.distributed.launch', '--nproc-per-node=2', '/workspace/te_native_bshd_thd/tests/pytorch/fused_attn/run_fused_attn_with_cp.py', 'dtype=bf16', 'model=cp_1_1', 'qkv_format=thd', 'kernel_backend=FusedAttention', 'cp_comm_type=p2p', 'fp8_mha=False']' returned non-zero exit status 1.
FAILED tests/pytorch/fused_attn/test_fused_attn_with_cp.py::test_cp_with_fused_attention[False-p2p-thd-cp_2_0-bf16] - subprocess.CalledProcessError: Command '['python3', '-m', 'torch.distributed.launch', '--nproc-per-node=2', '/workspace/te_native_bshd_thd/tests/pytorch/fused_attn/run_fused_attn_with_cp.py', 'dtype=bf16', 'model=cp_2_0', 'qkv_format=thd', 'kernel_backend=FusedAttention', 'cp_comm_type=p2p', 'fp8_mha=False']' returned non-zero exit status 1.
FAILED tests/pytorch/fused_attn/test_fused_attn_with_cp.py::test_cp_with_fused_attention[False-p2p-thd-cp_2_1-bf16] - subprocess.CalledProcessError: Command '['python3', '-m', 'torch.distributed.launch', '--nproc-per-node=2', '/workspace/te_native_bshd_thd/tests/pytorch/fused_attn/run_fused_attn_with_cp.py', 'dtype=bf16', 'model=cp_2_1', 'qkv_format=thd', 'kernel_backend=FusedAttention', 'cp_comm_type=p2p', 'fp8_mha=False']' returned non-zero exit status 1.
SKIPPED [48] tests/pytorch/fused_attn/test_fused_attn_with_cp.py:68: CP implementation with KV P2P does not support sliding window yet!
SKIPPED [16] tests/pytorch/fused_attn/test_fused_attn_with_cp.py:70: CP implementation with KV all-gather does not support THD format yet!
SKIPPED [24] tests/pytorch/fused_attn/test_fused_attn_with_cp.py:74: CP implementation with QKVO A2A does not support THD format yet!
SKIPPED [240] tests/pytorch/fused_attn/test_fused_attn_with_cp.py:133: FP8 attention has not been supported on ROCm yet!
SKIPPED [40] tests/pytorch/fused_attn/test_fused_attn_with_cp.py:153: CP implementation with KV P2P does not support sliding window yet!
SKIPPED [64] tests/pytorch/fused_attn/test_fused_attn_with_cp.py:137: THD format does not support post_scale_bias yet!
SKIPPED [32] tests/pytorch/fused_attn/test_fused_attn_with_cp.py:155: CP implementation with KV all-gather does not support bias yet!
SKIPPED [24] tests/pytorch/fused_attn/test_fused_attn_with_cp.py:139: CP implementation with KV all-gather does not support THD format yet!
SKIPPED [64] tests/pytorch/fused_attn/test_fused_attn_with_cp.py:157: CP implementation with QKVO A2A does not support bias yet!
SKIPPED [48] tests/pytorch/fused_attn/test_fused_attn_with_cp.py:141: CP implementation with QKVO A2A does not support THD format yet!
SKIPPED [104] tests/pytorch/fused_attn/test_fused_attn_with_cp.py:164: Only fp8 works with fp8_mha=True!
===== 4 failed, 204 passed, 704 skipped, 2 warnings in 3065.08s (0:51:05) ======
Error in test [ck] fused_attn/test_fused_attn_with_cp.py
Done [ck] fused_attn/test_fused_attn_with_cp.py
Got 1 test errors during run at level 3

Attached you can find the detailed log
torch_mgpu.txt

wangye805 and others added 18 commits October 16, 2025 15:54

[ROCm] manually pick up fwd native padding support from Meekail's PR

e90b991

Initial update

9d02d52

Updated stride

81bac35

Corrected typing in allocation portions

54ee86a

Applied Ye's patch

47a7cab

[ROCm] manually pick Meekail's PR to support native padding for bwd

0e0064f

[ROCm] jax use runtime segment

945ab5b

[ROCm] get runtime max_seqlen as well

579b592

[ROCm] support v2 bwd native padding

73247d9

Updated conversion to include bwd pass

7e1c3ef

Merge branch 'yewang12/te_aiter_native_padding_bwd' into zain/aiter-b…

51090d3

…wd-bshd-thd

Added BWD BSHD-->THD conversion and minor logic refactor

0e121ba

Corrected softmax lse bug

734692d

Updated logic flow and re-caclulation

5c24188

[ROCm] manually pick Meekail's PR to support native padding for bwd

b59d466

[ROCm] support v2 bwd native padding

Merge branch 'zain/aiter-bwd-bshd-thd' into zain/aiter-native-bshd-thd

97073fe

Added env var guard

f27a99f

Merge branch 'dev' into zain/aiter-native-bshd-thd

d757aef

Micky774 requested review from ipanfilo, wangye805 and wenchenvincent as code owners October 28, 2025 18:16

wangye805 requested changes Oct 28, 2025

View reviewed changes

Micky774 added 5 commits October 29, 2025 15:04

Updated ptr variables and streamlined dispatch

33c5912

Added env guard

af57290

Corrected bshd_to_thd conversion arguments

bc8f4a7

Corrected logical flow

b7f2cf8

Guarded memset and corrected allocation

3e48a02

wangye805 requested changes Nov 5, 2025

View reviewed changes

transformer_engine/common/fused_attn_rocm/fused_attn_ck.cpp Show resolved Hide resolved

Remove V3 API check and guard memsets

b1094c6

wangye805 requested changes Nov 6, 2025

View reviewed changes

Updated documentation

9ab8df4

wangye805 requested changes Nov 10, 2025

View reviewed changes

transformer_engine/common/fused_attn_rocm/fused_attn_ck.cpp Outdated Show resolved Hide resolved

transformer_engine/common/fused_attn_rocm/fused_attn_ck.cpp Outdated Show resolved Hide resolved

PR review reconciliation

2adfb6e

- Updated debug message for BSHD-->THD conversion - Added env variable to gate FWD output memset for padding - Removed guards on memsets for d{Q,K,V} matrices

Micky774 added 2 commits November 12, 2025 11:09

Added explicit test

bb3868d

Merge branch 'dev' into zain/aiter-native-bshd-thd

52c8167

wangye805 requested changes Nov 13, 2025

View reviewed changes

transformer_engine/common/fused_attn_rocm/fused_attn_ck.cpp Outdated Show resolved Hide resolved

Micky774 added 4 commits November 13, 2025 11:49

Formatting for bwd debug

6206d58

Resolved error when using mixed formats e.g. sbhd_2bshd

0582851

Updated guard on flash-attention forced support

78716de

Added check for SBHD_2BSHD

85bb6f6

Added guard on dk/dv memset

a12105d

wangye805 requested changes Nov 17, 2025

View reviewed changes

Micky774 added 5 commits November 24, 2025 10:57

Removed env var gating for dk/dv zero padding, formatting

2edd3d4

Added inline comment to test

221f286

Merge branch 'dev' into zain/aiter-native-bshd-thd

68d4faf

Merge branch 'dev' into zain/aiter-native-bshd-thd

e84e385

Corrected Softmax LSE buffer allocation

1eb25ea

wangye805 approved these changes Nov 25, 2025

View reviewed changes

wangye805 requested changes Dec 2, 2025

View reviewed changes

transformer_engine/common/fused_attn_rocm/fused_attn_ck.cpp Show resolved Hide resolved

Micky774 added 4 commits December 3, 2025 15:24

Correct Softmax LSE buffer memory allocation

6ecea1d

Adjusted fwd pass softmax lse allocation

0abc5e4

Adjusted bwd pass softmax conversion allocation

ed64d0b

Minor reversions

46fe62b

wangye805 approved these changes Dec 3, 2025

View reviewed changes

	NVTE_QKV_Format q_format = nvte_get_q_format(layout);
	NVTE_QKV_Format kv_format = nvte_get_kv_format(layout);
	bool is_ragged_q = (q_format == NVTE_QKV_Format::NVTE_THD);
	bool is_ragged_kv = (kv_format == NVTE_QKV_Format::NVTE_THD);

AITER Native Padding Support and BSHD + Padding --> THD + Padding conversion #354

Are you sure you want to change the base?

AITER Native Padding Support and BSHD + Padding --> THD + Padding conversion #354

Conversation

Micky774 commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wangye805 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wangye805 commented Nov 10, 2025

Uh oh!

Micky774 commented Nov 10, 2025

Uh oh!

wangye805 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

wenchenvincent commented Nov 11, 2025

Uh oh!

Uh oh!

wangye805 commented Nov 13, 2025

Uh oh!

Micky774 commented Nov 14, 2025

Uh oh!

wangye805 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wangye805 commented Dec 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Micky774 commented Oct 28, 2025 •

edited

Loading