Skip to content

Conversation

@minmengdie
Copy link
Contributor

@minmengdie minmengdie commented Oct 22, 2025

Motivation

fix the problem that v3's performance is worse than ck's

Technical Details

  1. group mode uses a different grid dimension layout to launch kernel
  2. update the fwd v3 kernel co file

Test Plan

MI300/MI308/MI355:
./benchmark_mha_fwd -mode=1 -b=32 -h=128 -h_k=8 -s=1024 -d=128 -iperm=0 -operm=0 -prec=bf16 -lse=1 -kname=1 -mask=2 -v=0
bash smoke_test_fwd_v3.sh

Test Result

perf result:
MI300:
image
MI308
image
MI355:
image
smoke test:
MI300:
image
MI308:
image

MI355:
image

Submission Checklist

@Copilot Copilot AI review requested due to automatic review settings October 22, 2025 08:43
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes a performance regression in the v3 FMHA (Fused Multi-Head Attention) implementation by introducing a new group-based kernel launch strategy. The changes replace the standard dispatcher with a group dispatcher that uses different grid/block dimensions to improve kernel launch efficiency.

Key changes:

  • Added launch_kernel_group method with optimized grid dimension calculations
  • Introduced fmha_fwd_v3_group_dispatcher function with modified tuning parameters
  • Updated all kernel launch call sites to use the new group dispatcher

Reviewed Changes

Copilot reviewed 2 out of 18 changed files in this pull request and generated 3 comments.

File Description
hsa/gfx950/fmha_v3_fwd/codegen.py Adds group-based kernel launcher for gfx950 with tune_opt=5 and architecture-specific block dimensions
hsa/gfx942/fmha_v3_fwd/codegen.py Adds group-based kernel launcher for gfx942 with tune_opt=4, then hardcoded to 0

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@minmengdie minmengdie force-pushed the mmd/dev/fix_fwd_perf branch from 81b943c to 8c30bed Compare October 23, 2025 01:23
@minmengdie minmengdie force-pushed the mmd/dev/fix_fwd_perf branch from 8c30bed to e439d24 Compare October 23, 2025 01:39
valarLip
valarLip previously approved these changes Oct 23, 2025
Copy link
Collaborator

@valarLip valarLip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants