Skip to content

Conversation

@xmfan
Copy link
Member

@xmfan xmfan commented Nov 4, 2025

FIXES #1935

Stacked PRs:

tlparse: https://fburl.com/sqxd6c0w


Workaround AC HOP mutation issue when tracing token dispatch

TORCH_COMPILE_FORCE_DISABLE_CACHES=1 HF_TOKEN=<token> HF_HUB_DISABLE_XET=1 CONFIG_FILE="./torchtitan/models/deepseek_v3/train_configs/deepseek_v3_16b.toml" with-proxy ./run_train.sh --model.name simple_fsdp.deepseek_v3

This is a problem for SimpleFSDP where we want to fullgraph the entire model, these "mutation" cause graph break

It is less of a problem outside SimpleFSDP, because we don't currently compile token dispatch

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 4, 2025
@xmfan xmfan marked this pull request as ready for review November 5, 2025 04:03
@xmfan xmfan requested a review from ruisizhang123 November 5, 2025 04:03
Comment on lines +146 to +149
input_shape,
permuted_indices,
input_splits,
output_splits,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These shouldn't be exposed to single-device model code. Plus, I don't think it will work if EP is not used.

If it's getting too hard, maybe we should use local_map / to_local to re-implement MoE.

@ruisizhang123
Copy link
Contributor

Thank you for the fix! Do you think it would require fewer user-side changes if we reimplemented apply_ac as a graph pass?

max_norm = 1.0 # grad norm clipping
steps = 1000
dataset = "c4" # supported datasets: c4_test (2K), c4 (177M)
dataset = "c4_test" # supported datasets: c4_test (2K), c4 (177M)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: toml file config changed.

@xmfan
Copy link
Member Author

xmfan commented Nov 7, 2025

This change is only needed if you use compile(torch.utils.checkpoint(, so graph pass wouldn't need it. but if you use both eager and graph-based, you will need this again

@xmfan xmfan marked this pull request as draft November 7, 2025 22:04
@ruisizhang123
Copy link
Contributor

ruisizhang123 commented Nov 7, 2025

This change is only needed if you use compile(torch.utils.checkpoint(, so graph pass wouldn't need it. but if you use both eager and graph-based, you will need this again

Yes, what I meant is that if we're going for a compiler-based approach to distributed parallelism in simplefsdp, it would make sense to have a specialized apply_ac function that’s also compiler-based. (and users are not allowed to use eager checkpoint to implement ac)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SimpleFSDP AC HOP mutation issue when tracing token dispatch

4 participants