[Kernel] Masked act_mul and fp8-quant Kernels for Batched MoE #19721

varun-sundar-rabindranath · 2025-06-17T02:26:06Z

Purpose

Optimization to reduce unnecessary compute

For the batched MoEs we allocate tensors of shape [num_experts, max_tokens_per_expert, hidden_size]. On main we process all the elements (num_experts x max_tokens_per_expert x hidden_size) - but not all max_tokens_per_expert are valid and we can skip some of these. To this effect, add batched versions of silu_mul and per_token_quant fp8 kernels.

Test Plan

Unit tests
Local E2E testing
commands:
DeepSeek V2 lite:-

VLLM_ALL2ALL_BACKEND="pplx"  vllm serve deepseek-ai/DeepSeek-V2-Lite --trust-remote-code  --data-parallel-size 2 --tensor-parallel-size 1  --enable-expert-parallel --port 9020  --no-enable-prefix-caching 

 lm_eval --model local-completions --tasks gsm8k --model_args model=deepseek-ai/DeepSeek-V2-Lite,base_url=http://127.0.0.1:9020/v1/completions,num_concurrent=30,max_retries=1,tokenized_requests=False  --limit 100 --seed 42

Qwen FP8:-

VLLM_ALL2ALL_BACKEND="deepep_low_latency" VLLM_USE_DEEP_GEMM=1  vllm serve Qwen/Qwen3-30B-A3B-FP8 --trust-remote-code  --data-parallel-size 2 --tensor-parallel-size 1  --enable-expert-parallel --port 9020  --no-enable-prefix-caching

 lm_eval --model local-completions --tasks gsm8k --model_args model=Qwen/Qwen3-30B-A3B-FP8,base_url=http://127.0.0.1:9020/v1/completions,num_concurrent=1,max_retries=1,tokenized_requests=False  --limit 100 --seed 42

Test Result

DeepSeek v2 lite

PR

|Tasks|Version| 	Filter 	|n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|  	3|flexible-extract| 	5|exact_match|↑  | 0.27|±  |0.0446|
| 	|   	|strict-match	| 	5|exact_match|↑  | 0.27|±  |0.0446|

main
|Tasks|Version| 	Filter 	|n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|  	3|flexible-extract| 	5|exact_match|↑  | 0.27|±  |0.0446|
| 	|   	|strict-match	| 	5|exact_match|↑  | 0.27|±  |0.0446|

Qwen Fp8

PR

|Tasks|Version| 	Filter 	|n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|  	3|flexible-extract| 	5|exact_match|↑  | 0.84|±  |0.0368|
| 	|   	|strict-match	| 	5|exact_match|↑  | 0.88|±  |0.0327|

main
|Tasks|Version| 	Filter 	|n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|  	3|flexible-extract| 	5|exact_match|↑  | 0.88|±  |0.0327|
| 	|   	|strict-match	| 	5|exact_match|↑  | 0.92|±  |0.0273|

Note:

The lm_eval results become a bit finicky when I try to use big num_concurrent values. This happens also on main - I have set it to 1 here to produce output that are a bit consistent.

(Optional) Documentation Update

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

gemini-code-assist · 2025-06-17T02:26:10Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

github-actions · 2025-06-17T02:26:18Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

varun-sundar-rabindranath · 2025-06-19T16:25:14Z

Marking this draft -- These kernels are not a priority at the moment given that a masked-fused-act-mul-quant exists in https://github.com/vllm-project/vllm/tree/ll_deepgemm_opt . We can revive this when needed.

Varun Sundar Rabindranath added 14 commits June 13, 2025 19:20

add batched silu mul

4d518d1

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

Refactor per_token_group_quant

ea96ddd

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

add batched per token quant

abcf846

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

batched -> masked

50162ac

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

batched_utils -> masked_kernels

9f13fb0

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

batched -> masked

b82cbe5

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

add masked silu-mul test

b2b365d

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

fixes and add batched per-token-quant tests

57dc316

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

relax silu mul tolerance

dcace53

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

plugin masked kernels

7ba8335

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

fix D blocking

06d28b2

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

better testing

8f9cb3d

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

make out_q optional

c98c2e2

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

fixes

c20487e

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

varun-sundar-rabindranath requested review from tlrmchlsmth, WoosukKwon, mgoin and robertgshaw2-redhat as code owners June 17, 2025 02:26

Varun Sundar Rabindranath added 5 commits June 16, 2025 19:49

add batched cuda silu and mul

2fb3d5f

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

fixes

97bda02

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

fixes

00ccbd4

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

add batched impl tests

fc5bc04

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

use cuda silu mul

67e76b5

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

mergify bot added the qwen Related to Qwen models label Jun 18, 2025

varun-sundar-rabindranath mentioned this pull request Jun 18, 2025

[Misc] DeepSeek Decode Optimizations #19807

Closed

Varun Sundar Rabindranath added 2 commits June 19, 2025 08:40

update batched silu mul kernel

727fbe9

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

update

ab6379e

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

varun-sundar-rabindranath marked this pull request as draft June 19, 2025 16:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Kernel] Masked act_mul and fp8-quant Kernels for Batched MoE #19721

[Kernel] Masked act_mul and fp8-quant Kernels for Batched MoE #19721

Uh oh!

varun-sundar-rabindranath commented Jun 17, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot commented Jun 17, 2025

Uh oh!

github-actions bot commented Jun 17, 2025

Uh oh!

varun-sundar-rabindranath commented Jun 19, 2025

Uh oh!

Uh oh!

Uh oh!

[Kernel] Masked act_mul and fp8-quant Kernels for Batched MoE #19721

Are you sure you want to change the base?

[Kernel] Masked act_mul and fp8-quant Kernels for Batched MoE #19721

Uh oh!

Conversation

varun-sundar-rabindranath commented Jun 17, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Note:

(Optional) Documentation Update

Uh oh!

gemini-code-assist bot commented Jun 17, 2025

Uh oh!

github-actions bot commented Jun 17, 2025

Uh oh!

varun-sundar-rabindranath commented Jun 19, 2025

Uh oh!

Uh oh!

varun-sundar-rabindranath commented Jun 17, 2025 •

edited by github-actions bot

Loading