Skip to content

Conversation

@wujingyue
Copy link
Collaborator

No description provided.

@github-actions
Copy link

Description

  • Add multi-GPU support for MoE testing

  • Implement custom grouped matrix multiplication with FP4 quantization

  • Introduce distributed parallelism strategies for grouped linear layers

  • Port test_moe.py to support multi-device execution


Changes walkthrough 📝

Relevant files
Enhancement
conftest.py
Relax import check for multi-GPU testing                                 

tests/python/multidevice/conftest.py

  • Disable nvfuser import assertion to allow dual imports in test_moe.py
  • Allow both nvfuser and nvfuser_direct to be imported simultaneously
  • +3/-1     
    Tests
    test_moe.py
    Add multi-GPU MoE test with FP4 quantization                         

    tests/python/multidevice/test_moe.py

  • Add full multi-GPU MoE test implementation with distributed tensor
    support
  • Implement FP4 quantized grouped matrix multiplication custom op
  • Create GroupedLinear modules with DTensor parallelism strategies
  • Add Llama 4 Maverick MoE model with thunderfx testing
  • +580/-0 

    PR Reviewer Guide 🔍

    Here are some key observations to aid the review process:

    🧪 PR contains tests
    ⚡ Recommended focus areas for review

    Disabled Safety Check

    The assertion preventing dual import of nvfuser and nvfuser_direct has been disabled, which could lead to module state conflicts or undefined behavior in multi-GPU setups.

    # The following check is disabled because test_moe.py imports both.
    # It imports nvfuser via Thunder and nvfuser_direct directly.
    # assert "nvfuser" not in sys.modules, "nvfuser is already imported"
    Placeholder Implementation

    The custom op nvfuser_f16a_nvfp4weight_scaled_grouped_mm currently returns the result of a fallback implementation using grouped_mm instead of performing actual FP4-weighted computation, indicating incomplete functionality.

    return grouped_mm(activation, dropme, offsets)
    Unused Parameters

    Several parameters in nvfuser_f16a_nvfp4weight_scaled_grouped_mm and its fake implementation are not used in the current logic, suggesting incomplete integration or missing implementation steps.

    def nvfuser_f16a_nvfp4weight_scaled_grouped_mm(
        activation: torch.Tensor,
        fp4_weight: torch.Tensor,
        weight_scaling_factor: torch.Tensor,
        global_scale: torch.Tensor,
        offsets: torch.Tensor,
        blockscale_offsets: torch.Tensor,
        problem_sizes: torch.Tensor,
        dropme: torch.Tensor,
    ) -> torch.Tensor:

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Labels

    None yet

    Projects

    None yet

    Development

    Successfully merging this pull request may close these issues.

    2 participants