Port test_moe.py to multi-GPU #5375

wujingyue · 2025-10-13T00:17:29Z

No description provided.

github-actions · 2025-10-13T00:18:32Z

Description

Add multi-GPU support for MoE testing
Implement custom grouped matrix multiplication with FP4 quantization
Introduce distributed parallelism strategies for grouped linear layers
Port test_moe.py to support multi-device execution

Changes walkthrough 📝

Relevant files

Enhancement

conftest.py `Relax import check for multi-GPU testing` tests/python/multidevice/conftest.py Disable nvfuser import assertion to allow dual imports in test_moe.py Allow both nvfuser and nvfuser_direct to be imported simultaneously	+3/-1

Tests

test_moe.py `Add multi-GPU MoE test with FP4 quantization` tests/python/multidevice/test_moe.py Add full multi-GPU MoE test implementation with distributed tensor support Implement FP4 quantized grouped matrix multiplication custom op Create GroupedLinear modules with DTensor parallelism strategies Add Llama 4 Maverick MoE model with thunderfx testing	+580/-0

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🧪 PR contains tests
⚡ Recommended focus areas for review Disabled Safety Check The assertion preventing dual import of `nvfuser` and `nvfuser_direct` has been disabled, which could lead to module state conflicts or undefined behavior in multi-GPU setups. # The following check is disabled because test_moe.py imports both. # It imports nvfuser via Thunder and nvfuser_direct directly. # assert "nvfuser" not in sys.modules, "nvfuser is already imported" Placeholder Implementation The custom op `nvfuser_f16a_nvfp4weight_scaled_grouped_mm` currently returns the result of a fallback implementation using `grouped_mm` instead of performing actual FP4-weighted computation, indicating incomplete functionality. return grouped_mm(activation, dropme, offsets) Unused Parameters Several parameters in `nvfuser_f16a_nvfp4weight_scaled_grouped_mm` and its fake implementation are not used in the current logic, suggesting incomplete integration or missing implementation steps. def nvfuser_f16a_nvfp4weight_scaled_grouped_mm( activation: torch.Tensor, fp4_weight: torch.Tensor, weight_scaling_factor: torch.Tensor, global_scale: torch.Tensor, offsets: torch.Tensor, blockscale_offsets: torch.Tensor, problem_sizes: torch.Tensor, dropme: torch.Tensor, ) -> torch.Tensor:

Port test_moe.py to multi-GPU

c9756bb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Port test_moe.py to multi-GPU #5375

Port test_moe.py to multi-GPU #5375

Uh oh!

wujingyue commented Oct 13, 2025

Uh oh!

github-actions bot commented Oct 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Port test_moe.py to multi-GPU #5375

Are you sure you want to change the base?

Port test_moe.py to multi-GPU #5375

Uh oh!

Conversation

wujingyue commented Oct 13, 2025

Uh oh!

github-actions bot commented Oct 13, 2025

Description

Changes walkthrough 📝

PR Reviewer Guide 🔍

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants