-
Notifications
You must be signed in to change notification settings - Fork 128
fix the problem that v3's performance is worse than ck's #1237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR fixes a performance regression in the v3 FMHA (Fused Multi-Head Attention) implementation by introducing a new group-based kernel launch strategy. The changes replace the standard dispatcher with a group dispatcher that uses different grid/block dimensions to improve kernel launch efficiency.
Key changes:
- Added
launch_kernel_groupmethod with optimized grid dimension calculations - Introduced
fmha_fwd_v3_group_dispatcherfunction with modified tuning parameters - Updated all kernel launch call sites to use the new group dispatcher
Reviewed Changes
Copilot reviewed 2 out of 18 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| hsa/gfx950/fmha_v3_fwd/codegen.py | Adds group-based kernel launcher for gfx950 with tune_opt=5 and architecture-specific block dimensions |
| hsa/gfx942/fmha_v3_fwd/codegen.py | Adds group-based kernel launcher for gfx942 with tune_opt=4, then hardcoded to 0 |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
81b943c to
8c30bed
Compare
8c30bed to
e439d24
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Motivation
fix the problem that v3's performance is worse than ck's
Technical Details
Test Plan
MI300/MI308/MI355:
./benchmark_mha_fwd -mode=1 -b=32 -h=128 -h_k=8 -s=1024 -d=128 -iperm=0 -operm=0 -prec=bf16 -lse=1 -kname=1 -mask=2 -v=0
bash smoke_test_fwd_v3.sh
Test Result
perf result:





MI300:
MI308
MI355:
smoke test:
MI300:
MI308:
MI355:

Submission Checklist