[multidevice][nvfp4 moe] Replicate weight not Shard & update desired inputs #5376

crcrpar · 2025-10-13T11:38:43Z

register_sharding for sharding propagation
Use distribute_tensor(..., src_data_rank=None) to use the local tensor as DTensor as packed fp4 dtype doesn't seem quite compatible with DTensor
Update desired layouts accordingly

Signed-off-by: Masaki Kozuki <[email protected]>

…of fp4x2 copy_ support Signed-off-by: Masaki Kozuki <[email protected]>

github-actions · 2025-10-13T11:39:45Z

Description

Register sharding strategy for fp4 grouped matmul
Use Replicate instead of Shard for fp4 weights
Update input layouts for grouped linear layers
Fix distributed tensor handling for packed fp4 data

Changes walkthrough 📝

Relevant files

Enhancement

test_moe.py `Add sharding registration and fix fp4 tensor distribution` tests/python/multidevice/test_moe.py Import `register_sharding` for custom op sharding registration Define sharding strategies for fp4 grouped mm with Replicate for fp4 tensors Update `_partition_fn` to use `Replicate()` and `src_data_rank=None` for fp4_weight and b_sf Extend input layouts to support 4 inputs with proper sharding annotations	+111/-21

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🧪 PR contains tests
⚡ Recommended focus areas for review Possible Issue The sharding strategy for `dropme` in `nvfuser_grouped_mm_sharding` uses `Shard(2)`, but it is unclear if this tensor has at least 3 dimensions. This may lead to a runtime error if `dropme` has fewer than 3 dimensions. Shard(2), # dropme sharded on output feature dimension (last dim) Inconsistent Layout Update The `input_layouts` and `desired_input_layouts` are updated to include four inputs, but it should be verified that all call sites and consumers of these layouts are compatible with the new four-element tuple structure. self.input_layouts = input_layouts or ( Shard(-1), Replicate(), Replicate(), Replicate(), ) Redundant Replication The `_partition_fn` methods replicate `fp4_weight` and `b_sf` using `distribute_tensor` with `Replicate()` and `src_data_rank=None`, but it should be confirmed whether this replication is necessary given that the data is already present on each rank and not being sharded. module.fp4_weight = nn.Parameter( distribute_tensor( module.fp4_weight, device_mesh, [Replicate()], src_data_rank=None ),

crcrpar added 2 commits October 13, 2025 04:01

register_sharding for nvfp4 grouped mm

cdbfc7a

Signed-off-by: Masaki Kozuki <[email protected]>

update input layouts and Replicate not Shard for not due to the lack …

0288403

…of fp4x2 copy_ support Signed-off-by: Masaki Kozuki <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[multidevice][nvfp4 moe] Replicate weight not Shard & update desired inputs #5376

[multidevice][nvfp4 moe] Replicate weight not Shard & update desired inputs #5376

Uh oh!

crcrpar commented Oct 13, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Oct 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

[multidevice][nvfp4 moe] Replicate weight not Shard & update desired inputs #5376

Are you sure you want to change the base?

[multidevice][nvfp4 moe] Replicate weight not Shard & update desired inputs #5376

Uh oh!

Conversation

crcrpar commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Oct 13, 2025

Description

Changes walkthrough 📝

PR Reviewer Guide 🔍

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

crcrpar commented Oct 13, 2025 •

edited

Loading