[Full DTensor] Add full_dtensor flag #2002

fegin · 2025-11-07T07:28:45Z

Stack from ghstack (oldest at bottom):

When full_dtensor is True, the compute_placement will be preserved. This means that to_local() won't be called for fsdp only case. nD parallelism case (fsdp + tp) will error out as we have not implemented this case.

This argument doesn't affect the current simple_fsdp. We have verified full_dtensor=True case with the full dtensor skleton PR, which will be published once it is ready.

[ghstack-poisoned]

When full_dtensor is True, the compute_placement will be preserved. This means that `to_local()` won't be called for fsdp only case. nD parallelism case (fsdp + tp) will error out as we have not implemented this case. This argument doesn't affect the current simple_fsdp. We have verified `full_dtensor=True` case with the full dtensor skleton PR, which will be published once it is ready. ghstack-source-id: 43384a7 Pull-Request: #2002

[ghstack-poisoned]

When full_dtensor is True, the compute_placement will be preserved. This means that `to_local()` won't be called for fsdp only case. nD parallelism case (fsdp + tp) will error out as we have not implemented this case. This argument doesn't affect the current simple_fsdp. We have verified `full_dtensor=True` case with the full dtensor skleton PR, which will be published once it is ready. ghstack-source-id: b8a288a Pull-Request: #2002

Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0) (oldest at bottom): * #2002 * #2001 * #1995 * __->__ #1985 We are adding more actions to convert the raw inputs and label. 1. The new CP can do the input/label/BlockMask sharding this in this method. 2. The experimental full dtensor model can simply override this method without changing too many Trainer code. This method is extracted from #1857 Makeing this a standalone PR allows us to continue the two projects above without one blocks another.

[ghstack-poisoned]

When full_dtensor is True, the compute_placement will be preserved. This means that `to_local()` won't be called for fsdp only case. nD parallelism case (fsdp + tp) will error out as we have not implemented this case. This argument doesn't affect the current simple_fsdp. We have verified `full_dtensor=True` case with the full dtensor skleton PR, which will be published once it is ready. ghstack-source-id: 9f9efce Pull-Request: #2002

Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0) (oldest at bottom): * #2002 * #2001 * __->__ #1995 People are creating different train.py and duplicate the `main` function. But in realitly people just want to use different Trainer subclasses. This PR creates a main() in torchtitan/train.py to deduplicate the code.

ruisizhang123

I thought the full dtensor version is in compiler toolkit, isn't it? cc. @SherlockNoMad @yiming0416

Do we have a plan to migrate full dtensor to simplefsdp folder?

yiming0416 · 2025-11-07T21:27:46Z

I thought the full dtensor version is in compiler toolkit, isn't it? cc. @SherlockNoMad @yiming0416

Do we have a plan to migrate full dtensor to simplefsdp folder?

@ruisizhang123 Not really, DTensorizing inputs in compiler toolkit only applies to the tp submesh (code pointer: https://github.com/pytorch/torchtitan/blob/main/torchtitan/experiments/compiler_toolkit/common_utils.py#L27-L33)

Also currently we directly import the parallelize_fn from simple_fsdp in the compiler_toolkit experiments folder. So we are always using the same SimpleFSDP as the simple_fsdp experiment folder.

ruisizhang123 · 2025-11-07T21:43:55Z

torchtitan/experiments/simple_fsdp/simple_fsdp.py

        mp_policy: MixedPrecisionPolicy | None,
        reshard_after_forward: bool,
        reduction_divide_factor: float | None,
+        full_dtensor: bool = False,


Should we rename full_dtensor to sth more explicit (e.g., is_input_dtensor)?

is_input_dtensor sounds like we are using full dtensor because input is a dtensor, but the idea should be -- we are using full dtensor so input should be full dtensor, as well as the params. So I think full_dtensor or use_full_dtensor is OK for now. Eventually I think we should deprecate the non-full-dtensor paths.

tianyu-l · 2025-11-07T22:43:28Z

torchtitan/experiments/simple_fsdp/simple_fsdp.py

        mp_policy: MixedPrecisionPolicy | None,
        reshard_after_forward: bool,
        reduction_divide_factor: float | None,
+        full_dtensor: bool = False,


is_input_dtensor sounds like we are using full dtensor because input is a dtensor, but the idea should be -- we are using full dtensor so input should be full dtensor, as well as the params. So I think full_dtensor or use_full_dtensor is OK for now. Eventually I think we should deprecate the non-full-dtensor paths.

Update

eaf4d04

[ghstack-poisoned]

fegin mentioned this pull request Nov 7, 2025

Add post_dataloading_processing method to Trainer #1985

Merged

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 7, 2025

This was referenced Nov 7, 2025

Deduplicate TorchTitan main function #1995

Merged

[SimpleFSDP] Add typing to simple_fsdp.py #2001

Open

Update

6d5db80

[ghstack-poisoned]

fegin requested review from ruisizhang123 and tianyu-l November 7, 2025 07:35

Update

1419157

[ghstack-poisoned]

ruisizhang123 reviewed Nov 7, 2025

View reviewed changes

tianyu-l approved these changes Nov 7, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Full DTensor] Add full_dtensor flag #2002

[Full DTensor] Add full_dtensor flag #2002

Uh oh!

fegin commented Nov 7, 2025 •

edited

Loading

Uh oh!

ruisizhang123 left a comment •

edited

Loading

Uh oh!

yiming0416 commented Nov 7, 2025

Uh oh!

ruisizhang123 Nov 7, 2025 •

edited

Loading

Uh oh!

tianyu-l Nov 7, 2025

Uh oh!

tianyu-l Nov 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[Full DTensor] Add full_dtensor flag #2002

Are you sure you want to change the base?

[Full DTensor] Add full_dtensor flag #2002

Uh oh!

Conversation

fegin commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ruisizhang123 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yiming0416 commented Nov 7, 2025

Uh oh!

ruisizhang123 Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tianyu-l Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

tianyu-l Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

fegin commented Nov 7, 2025 •

edited

Loading

ruisizhang123 left a comment •

edited

Loading

ruisizhang123 Nov 7, 2025 •

edited

Loading