fix the problem that v3's performance is worse than ck's #1237

minmengdie · 2025-10-22T08:43:50Z

Motivation

fix the problem that v3's performance is worse than ck's

Technical Details

group mode uses a different grid dimension layout to launch kernel
update the fwd v3 kernel co file

Test Plan

MI300/MI308/MI355:
./benchmark_mha_fwd -mode=1 -b=32 -h=128 -h_k=8 -s=1024 -d=128 -iperm=0 -operm=0 -prec=bf16 -lse=1 -kname=1 -mask=2 -v=0
bash smoke_test_fwd_v3.sh

Test Result

perf result:
MI300:

MI308

MI355:

smoke test:
MI300:

MI308:

MI355:

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Copilot

Pull Request Overview

This PR fixes a performance regression in the v3 FMHA (Fused Multi-Head Attention) implementation by introducing a new group-based kernel launch strategy. The changes replace the standard dispatcher with a group dispatcher that uses different grid/block dimensions to improve kernel launch efficiency.

Key changes:

Added launch_kernel_group method with optimized grid dimension calculations
Introduced fmha_fwd_v3_group_dispatcher function with modified tuning parameters
Updated all kernel launch call sites to use the new group dispatcher

Reviewed Changes

Copilot reviewed 2 out of 18 changed files in this pull request and generated 3 comments.

File	Description
hsa/gfx950/fmha_v3_fwd/codegen.py	Adds group-based kernel launcher for gfx950 with tune_opt=5 and architecture-specific block dimensions
hsa/gfx942/fmha_v3_fwd/codegen.py	Adds group-based kernel launcher for gfx942 with tune_opt=4, then hardcoded to 0

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

hsa/gfx942/fmha_v3_fwd/codegen.py

hsa/gfx950/fmha_v3_fwd/codegen.py

hsa/gfx942/fmha_v3_fwd/codegen.py

valarLip

LGTM

Copilot AI review requested due to automatic review settings October 22, 2025 08:43

Copilot AI reviewed Oct 22, 2025

View reviewed changes

hsa/gfx942/fmha_v3_fwd/codegen.py Show resolved Hide resolved

hsa/gfx950/fmha_v3_fwd/codegen.py Show resolved Hide resolved

hsa/gfx942/fmha_v3_fwd/codegen.py Show resolved Hide resolved

minmengdie force-pushed the mmd/dev/fix_fwd_perf branch from 81b943c to 8c30bed Compare October 23, 2025 01:23

fix fwd v3 kernel perf and opt err

e439d24

minmengdie force-pushed the mmd/dev/fix_fwd_perf branch from 8c30bed to e439d24 Compare October 23, 2025 01:39

minmengdie added 2 commits October 23, 2025 06:41

fix opt err

d0d9a9e

fix gfx950

620cc7c

valarLip previously approved these changes Oct 23, 2025

View reviewed changes

fix opt 5 err

ece6a7b

minmengdie dismissed valarLip’s stale review via ece6a7b October 24, 2025 05:47

fix gfx950 opt5

990ed39

valarLip approved these changes Oct 25, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix the problem that v3's performance is worse than ck's #1237

fix the problem that v3's performance is worse than ck's #1237

Uh oh!

minmengdie commented Oct 22, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

valarLip left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix the problem that v3's performance is worse than ck's #1237

Are you sure you want to change the base?

fix the problem that v3's performance is worse than ck's #1237

Uh oh!

Conversation

minmengdie commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

valarLip left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

minmengdie commented Oct 22, 2025 •

edited

Loading