Use FusedMoEQuantConfig everywhere #19921

bnellnm · 2025-06-20T20:49:10Z

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

Test Plan

Test Result

(Optional) Documentation Update

Signed-off-by: Bill Nell <[email protected]>

github-actions · 2025-06-20T20:49:20Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Summary of Changes

Hello @bnellnm, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request undertakes a significant refactoring of the Mixture of Experts (MoE) implementation to centralize and standardize how quantization configurations are handled. By introducing a dedicated FusedMoEQuantConfig class, it streamlines parameter passing across various MoE kernels and their modular components. This change enhances code maintainability, improves testability, and lays the groundwork for more robust quantization support within the MoE system.

Highlights

Quantization Configuration Centralization: Introduced a new FusedMoEQuantConfig class to centralize and standardize quantization parameters across all Mixture of Experts (MoE) kernels. This replaces numerous individual boolean flags (e.g., use_fp8_w8a8, per_channel_quant) with a single, structured configuration object.
Modular MoE Architecture Enhancement: The modular MoE kernel architecture has been significantly refactored to consistently utilize FusedMoEQuantConfig. This includes updates to FusedMoEPrepareAndFinalize and FusedMoEPermuteExpertsUnpermute base classes, and their concrete implementations (e.g., TritonExperts, DeepGemmExperts, PplxPrepareAndFinalize, DeepEPPrepareAndFinalize).
Improved Quantization Test Coverage: New dedicated test files (tests/kernels/moe/test_block_fp8.py and tests/kernels/moe/test_block_int8.py) have been added to thoroughly test block-wise FP8 and INT8 quantization for fused MoE operations, including integration with DeepGemm kernels.
Standardized MoE Test Utilities: A new utility file (tests/kernels/moe/utils.py) has been introduced, providing helper functions like make_quantized_test_activations, make_test_weights, triton_moe, and batched_moe to streamline and standardize MoE kernel testing with quantization.
Refactored Quantization Helpers: Core quantization helper functions, such as moe_kernel_quantize_input and native_w8a8_block_matmul, have been updated and/or moved to tests/kernels/quant_utils.py to support the new FusedMoEQuantConfig and improve code organization.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

The pull request focuses on using FusedMoEQuantConfig everywhere, which involves removing redundant parameters and ensuring consistency in quantization configurations across different MoE implementations. The changes include modifications in test files to accommodate the new configuration and ensure correct functionality with various quantization settings. The code review identified areas for improved documentation, consistency, and clarity, particularly regarding the use of environment variables and the rationale behind certain design choices.

gemini-code-assist · 2025-06-20T20:52:50Z

tests/kernels/moe/test_batched_moe.py

+    in_dtype: torch.dtype
+    quant_dtype: Optional[torch.dtype]
+    out_dtype: torch.dtype


Consider adding docstrings to explain the purpose of each field in the BatchedMMConfig dataclass.

gemini-code-assist · 2025-06-20T20:52:50Z

tests/kernels/moe/test_batched_moe.py

+            "BLOCK_SIZE_K": 16 if dtype.itemsize > 1 else 32
+        },


The block size is hardcoded here. Consider adding it as a parameter to the test function.

gemini-code-assist · 2025-06-20T20:52:50Z

tests/kernels/moe/test_cutlass_moe.py

+    # These sizes trigger wrong answers.
+    #(7232, 2048, 5120),
+    #(40000, 2048, 5120),


Consider adding a comment to explain why these sizes trigger wrong answers.

Signed-off-by: Bill Nell <[email protected]>

bnellnm added 30 commits June 16, 2025 17:39

turn try_get_optimal_moe_config into an op so it can be torch.compiled

3d288bf

Signed-off-by: Bill Nell <[email protected]>

lint

385e0c5

Signed-off-by: Bill Nell <[email protected]>

torch.compile tests

c98ffbe

Signed-off-by: Bill Nell <[email protected]>

add tests

c1c362a

Signed-off-by: Bill Nell <[email protected]>

add compiler + cudagraph tests

776ad95

Signed-off-by: Bill Nell <[email protected]>

tests

961b5e8

Signed-off-by: Bill Nell <[email protected]>

reduce number of compile/cudagraph tests

bd9bd37

Signed-off-by: Bill Nell <[email protected]>

lint

23f26c9

Signed-off-by: Bill Nell <[email protected]>

fix lint

5d564f6

Signed-off-by: Bill Nell <[email protected]>

fix lint

06b4583

Signed-off-by: Bill Nell <[email protected]>

replace import that lint removed

463ccaa

Signed-off-by: Bill Nell <[email protected]>

fixes

4ab6af7

Signed-off-by: Bill Nell <[email protected]>

lint

695203d

Signed-off-by: Bill Nell <[email protected]>

opify at a higher level

287a204

Signed-off-by: Bill Nell <[email protected]>

de-opify deepgemm kernels

1c9fd39

Signed-off-by: Bill Nell <[email protected]>

remove cruft

79a1962

Signed-off-by: Bill Nell <[email protected]>

MoE refactoring

07d3aae

Signed-off-by: Bill Nell <[email protected]>

make FusedMoEModularKernel a Leaf

847ec16

Signed-off-by: Bill Nell <[email protected]>

make FusedMoEModularKernel a Leaf

5859222

Signed-off-by: Bill Nell <[email protected]>

fix format

10137bb

Signed-off-by: Bill Nell <[email protected]>

config stuff + add more tests

c73d6ba

Signed-off-by: Bill Nell <[email protected]>

fixes

230a1fe

Signed-off-by: Bill Nell <[email protected]>

wip test

782c3a0

Signed-off-by: Bill Nell <[email protected]>

fix mergea

1bae03b

Signed-off-by: Bill Nell <[email protected]>

disable buggy fp8 tests

7a95679

Signed-off-by: Bill Nell <[email protected]>

fixes

5e22409

Signed-off-by: Bill Nell <[email protected]>

more lint

12e42ea

Signed-off-by: Bill Nell <[email protected]>

more lint

0b2f817

Signed-off-by: Bill Nell <[email protected]>

merge

4fdeb70

Signed-off-by: Bill Nell <[email protected]>

fix merge

6b4e406

Signed-off-by: Bill Nell <[email protected]>

bnellnm added 8 commits June 18, 2025 21:50

fix deep gemm test

f1572d1

Signed-off-by: Bill Nell <[email protected]>

add supports_expert_map method + cleanup select_gemm_impl methods

4c35a6c

Signed-off-by: Bill Nell <[email protected]>

lint

69f878b

Signed-off-by: Bill Nell <[email protected]>

revert random linter changes

df3a90e

Signed-off-by: Bill Nell <[email protected]>

fix comments + lint

b9046e7

Signed-off-by: Bill Nell <[email protected]>

remove some logging

875a9c4

Signed-off-by: Bill Nell <[email protected]>

remove unused method

ebb9e13

Signed-off-by: Bill Nell <[email protected]>

try to fix lint

e79b40a

Signed-off-by: Bill Nell <[email protected]>

mergify bot added the rocm Related to AMD ROCm label Jun 20, 2025

gemini-code-assist bot reviewed Jun 20, 2025

View reviewed changes

bnellnm added 2 commits June 20, 2025 21:27

add some asserts to make lint happy

b5d7cba

Signed-off-by: Bill Nell <[email protected]>

Use FusedMoEQuantConfig everywhere

1b57e9d

Signed-off-by: Bill Nell <[email protected]>

bnellnm force-pushed the moe-quant-config branch from e86122b to 1b57e9d Compare June 20, 2025 21:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Use FusedMoEQuantConfig everywhere #19921

Use FusedMoEQuantConfig everywhere #19921

bnellnm commented Jun 20, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Jun 20, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jun 20, 2025

Uh oh!

gemini-code-assist bot Jun 20, 2025

Uh oh!

gemini-code-assist bot Jun 20, 2025

Uh oh!

Uh oh!

Uh oh!

Use FusedMoEQuantConfig everywhere #19921

Are you sure you want to change the base?

Use FusedMoEQuantConfig everywhere #19921

Conversation

bnellnm commented Jun 20, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Essential Elements of an Effective PR Description Checklist

Purpose

Test Plan

Test Result

(Optional) Documentation Update

Uh oh!

github-actions bot commented Jun 20, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jun 20, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jun 20, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jun 20, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bnellnm commented Jun 20, 2025 •

edited by github-actions bot

Loading