[mxfp8 MoE training] Support mxfp8 all to all in expert parallel #1765

danielvegamyhre · 2025-09-26T23:52:58Z

Summary

Make EP a2a dispatch and a2a combine each be separately configurable to use either "default" or "mxfp8" impl
"mxfp8" impl uses torchao's new to_mxfp8_a2a_dequant, which has the exact same API as functional collective all_to_all_single_autograd and is differentiable, so it can be used as a drop-in replacement for the default a2a impl.
torchao to_mxfp8_a2a_dequant works as follows:
- quantizes the inputs to mxfp8
- all_to_all_single on e4m3 data
- all_to_all_single on e8m0 scales
- dequantize outputs back to original precision

Performance

Single node benchmarks with 4xB200
Llama4 16e default configs; FSDP=4, EP=4; AC=none; compile=True; seq_len=8192; local_bs=8
Reduced num layers from 48 -> 2 to avoid OOM in single node setting
Debug model config:

llama4_configs = {
    "debugmodel": TransformerModelArgs(
        dim=5120,
        n_layers=2,
        n_heads=40,
        n_kv_heads=8,
        ffn_dim_multiplier=1.2,
        multiple_of=2048,
        rope_theta=500000,
        max_seq_len=10485760,
        moe_args=MoEArgs(num_experts=16),
        interleave_moe_layer_step=1,
    ),

Full repro commands

Configuration	Throughput (Median Tokens/s)	Max Memory (GiB)
bf16 baseline	49381.0	145.55
MXFP8 for Linears only	52038.0	146.62
MXFP8 for Grouped GEMMs only	69350.0	144.71
MXFP8 for Linears + Grouped GEMMs	70747.0	145.32
MXFP8 for Linears + Grouped GEMMs + A2A Dispatch	72602.5	145.45
MXFP8 for Linears + Grouped GEMMs + A2A Dispatch + A2A Combine	73152.0	146.08

Additional context on design/implementation choices

Note: both default and mxfp8 impls require the d2h sync to get input_splits/output_splits on the host for the a2a call.
- I also explored a no-sync/on-device implementation using Triton + Symmetric memory, and got it working e2e in a torchtitan PoC: [mxfp8 moe training] mxfp8 a2a working e2e in torchtitan llama4 training; improve tests + bench scripts ao#3088
- I found that this design of preallocating over-allocated symmetric memory buffers for exchange of variable token numbers (to avoid syncs required for exact allocation, while risking either crash or token dropping if overflow factor heuristic is wrong), is fundamentally in conflict with the torchtitan MoE design of doing a d2h sync to safely do exact allocation. Extracting out the variable size outputs from the padded buffers causes d2h sync (causing perf to regress below baseline), and we can't avoid this since otherwise downstream ops will break due to shape mismatches - the whole model basically would need to be designed assuming the static padded shapes.
- Therefore, we choose to integrate this more straight-forward impl that is natively compatible with non-experimental titan MoE design

Additional background on motivation

MoE performance literature has shown ~47% average runtime for flagship OSS MoE models (Qwen2, Phi3.5, Mixtra8x7b) is due to exposed MoE comms.
Torchtitan Llama4 debug model with EP=4, ~30% of MoE training with EP is a2a comms, most of that exposed (see trace screenshot), which directionally corroborates this.
We can optimize this via (1) quantizing the comms to minimize data sent over NVLink/IB, (2) avoid d2h sync that can occur in implementations which move a2a output splits from device->host to compute exact preallocation necessary for incoming tokens, and (3) finer grained overlapping techniques.

30% of llama4 model profiled runtime is all2all comms

FSDP=4, EP=4, dim=5120, num_experts=16, seq_len=8192, local_batch_size=8

47% avg runtime devoted to MoE comms in profiled OSS models

tianyu-l

For activation checkpointing with mxfp8 a2a, should we change this save list to mimic what happens in bf16?
https://github.com/pytorch/torchtitan/blob/main/torchtitan/experiments/llama4/infra/parallelize.py#L42

Actually I have a question:
In the save list we don't have quantized matmul. Wouldn't it create unfair comparison between bf16 vs. quantized runs before?

tianyu-l · 2025-10-12T05:40:10Z

torchtitan/config/job_config.py

    Note that this is still an experimental feature.
    """

+    expert_parallel_a2a_dispatch_impl: Literal["default", "mxfp8"] = "default"


To be consistent with how to config low-precision in torchtitan, let's put these under job_config.quantize.

won't that cause a conflict with EP a2a impl in Parallelism once you add NVSHMEM impl? we could have the quantize a2a impl override the Parallelism a2a impl, but that may be unclear to users, what do you think?

tianyu-l · 2025-10-12T05:48:31Z

torchtitan/distributed/expert_parallel.py

+    """
+
+    def __init__(
+        self, a2a_dispatch_impl: str = "default", a2a_combine_impl: str = "default"


Instead of adding configs to constructor, a more object-oriented way would be

refactor ExpertParallel to have self._all_to_all_dispatch_fn and self._all_to_all_combine_fn, both with default value all_to_all_single_autograd.

Let MXExpertParallel inherit ExpertParallel whose constructor set those variables to use mxfp8 depending on the mxfp8 config. The class should sit under quantize/mx folder.

danielvegamyhre · 2025-10-13T17:05:12Z

Actually I have a question:
In the save list we don't have quantized matmul. Wouldn't it create unfair comparison between bf16 vs. quantized runs before?

@tianyu-l I've been using AC=none for my benchmarks, but yes we need to update this list for mxfp8 MoE training.

danielvegamyhre · 2025-10-17T18:48:01Z

Closing in favor of #1912 to avoid nasty merge conflicts during the refactor

danielvegamyhre requested review from fegin, tianyu-l, wconstab and wwwjn as code owners September 26, 2025 23:52

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Sep 26, 2025

danielvegamyhre force-pushed the mx-a2a branch 2 times, most recently from 4527f8b to bba9c6a Compare September 27, 2025 00:13

danielvegamyhre mentioned this pull request Sep 27, 2025

[mxfp8 moe training] mxfp8 a2a working e2e in torchtitan llama4 training; improve tests + bench scripts pytorch/ao#3088

Merged

danielvegamyhre force-pushed the mx-a2a branch 2 times, most recently from fde6de2 to a48e631 Compare September 29, 2025 22:27

danielvegamyhre changed the title ~~[WIP] Support mxfp8 on device all_to_all_v in expert parallel~~ Support mxfp8 on device all_to_all_v in expert parallel Sep 29, 2025

danielvegamyhre marked this pull request as draft September 29, 2025 23:04

danielvegamyhre force-pushed the mx-a2a branch from a48e631 to 89bac2b Compare September 30, 2025 02:08

danielvegamyhre added 7 commits September 30, 2025 13:29

[mxfp8 moe training] add mxfp8 a2a

35edc86

pipe through configs for mxfp8 a2a

638265a

preallocated buffer not big enough

b71f271

repro for perm error

767d636

refactor

e924b49

fix IMAs, input/output splits need to be reversed in a2a combine

b0ecc35

overallocate max output tokens per ep rank

5c38830

danielvegamyhre force-pushed the mx-a2a branch from 89bac2b to 0a0d676 Compare September 30, 2025 21:10

danielvegamyhre changed the title ~~Support mxfp8 on device all_to_all_v in expert parallel~~ Support mxfp8 all to all in expert parallel Sep 30, 2025

danielvegamyhre changed the title ~~Support mxfp8 all to all in expert parallel~~ [mxfp8 MoE training] Support mxfp8 all to all in expert parallel Sep 30, 2025

use sync based mxfp8 a2a

751a472

danielvegamyhre force-pushed the mx-a2a branch from 0a0d676 to 751a472 Compare September 30, 2025 21:17

danielvegamyhre marked this pull request as ready for review September 30, 2025 21:17

danielvegamyhre marked this pull request as draft September 30, 2025 22:36

update torchao api name

4d51ae2

danielvegamyhre marked this pull request as ready for review October 2, 2025 22:19

danielvegamyhre mentioned this pull request Oct 2, 2025

[mxfp8 moe traing] to_mxfp8_a2a_dequant flat perf vs high precision a2a pytorch/ao#3112

Open

danielvegamyhre force-pushed the mx-a2a branch from af79e9a to bdbdd45 Compare October 2, 2025 22:30

a2a dispatch and combine configurable separately

bf6d943

danielvegamyhre force-pushed the mx-a2a branch from bdbdd45 to bf6d943 Compare October 2, 2025 22:43

tianyu-l reviewed Oct 12, 2025

View reviewed changes

danielvegamyhre closed this Oct 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[mxfp8 MoE training] Support mxfp8 all to all in expert parallel #1765

[mxfp8 MoE training] Support mxfp8 all to all in expert parallel #1765

Uh oh!

danielvegamyhre commented Sep 26, 2025 •

edited

Loading

Uh oh!

tianyu-l left a comment

Uh oh!

tianyu-l Oct 12, 2025

Uh oh!

danielvegamyhre Oct 17, 2025

Uh oh!

tianyu-l Oct 12, 2025

Uh oh!

danielvegamyhre commented Oct 13, 2025

Uh oh!

danielvegamyhre commented Oct 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[mxfp8 MoE training] Support mxfp8 all to all in expert parallel #1765

[mxfp8 MoE training] Support mxfp8 all to all in expert parallel #1765

Uh oh!

Conversation

danielvegamyhre commented Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Performance

Additional context on design/implementation choices

Additional background on motivation

30% of llama4 model profiled runtime is all2all comms

47% avg runtime devoted to MoE comms in profiled OSS models

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

tianyu-l Oct 12, 2025

Choose a reason for hiding this comment

Uh oh!

danielvegamyhre Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

tianyu-l Oct 12, 2025

Choose a reason for hiding this comment

Uh oh!

danielvegamyhre commented Oct 13, 2025

Uh oh!

danielvegamyhre commented Oct 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

danielvegamyhre commented Sep 26, 2025 •

edited

Loading