Skip to content

Conversation

@fegin
Copy link
Contributor

@fegin fegin commented Nov 7, 2025

Stack from ghstack (oldest at bottom):

When full_dtensor is True, the compute_placement will be preserved. This means that to_local() won't be called for fsdp only case. nD parallelism case (fsdp + tp) will error out as we have not implemented this case.

This argument doesn't affect the current simple_fsdp. We have verified full_dtensor=True case with the full dtensor skleton PR, which will be published once it is ready.

[ghstack-poisoned]
fegin added a commit that referenced this pull request Nov 7, 2025
When full_dtensor is True, the compute_placement will be preserved. This means that `to_local()` won't be called for fsdp only case. nD parallelism case (fsdp + tp) will error out as we have not implemented this case.

This argument doesn't affect the current simple_fsdp. We have verified `full_dtensor=True` case with the full dtensor skleton PR, which will be published once it is ready.


ghstack-source-id: 43384a7
Pull-Request: #2002
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 7, 2025
[ghstack-poisoned]
fegin added a commit that referenced this pull request Nov 7, 2025
When full_dtensor is True, the compute_placement will be preserved. This means that `to_local()` won't be called for fsdp only case. nD parallelism case (fsdp + tp) will error out as we have not implemented this case.

This argument doesn't affect the current simple_fsdp. We have verified `full_dtensor=True` case with the full dtensor skleton PR, which will be published once it is ready.


ghstack-source-id: b8a288a
Pull-Request: #2002
fegin added a commit that referenced this pull request Nov 7, 2025
Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0)
(oldest at bottom):
* #2002
* #2001
* #1995
* __->__ #1985

We are adding more actions to convert the raw inputs and label.

1. The new CP can do the input/label/BlockMask sharding this in this
method.
2. The experimental full dtensor model can simply override this method
without changing too many Trainer code.

This method is extracted from
#1857

Makeing this a standalone PR allows us to continue the two projects
above without one blocks another.
[ghstack-poisoned]
fegin added a commit that referenced this pull request Nov 7, 2025
When full_dtensor is True, the compute_placement will be preserved. This means that `to_local()` won't be called for fsdp only case. nD parallelism case (fsdp + tp) will error out as we have not implemented this case.

This argument doesn't affect the current simple_fsdp. We have verified `full_dtensor=True` case with the full dtensor skleton PR, which will be published once it is ready.


ghstack-source-id: 9f9efce
Pull-Request: #2002
fegin added a commit that referenced this pull request Nov 7, 2025
Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0)
(oldest at bottom):
* #2002
* #2001
* __->__ #1995

People are creating different train.py and duplicate the `main`
function. But in realitly people just want to use different Trainer
subclasses. This PR creates a main() in torchtitan/train.py to
deduplicate the code.
Copy link
Contributor

@ruisizhang123 ruisizhang123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the full dtensor version is in compiler toolkit, isn't it? cc. @SherlockNoMad @yiming0416

Do we have a plan to migrate full dtensor to simplefsdp folder?

@yiming0416
Copy link
Contributor

I thought the full dtensor version is in compiler toolkit, isn't it? cc. @SherlockNoMad @yiming0416

Do we have a plan to migrate full dtensor to simplefsdp folder?

@ruisizhang123 Not really, DTensorizing inputs in compiler toolkit only applies to the tp submesh (code pointer: https://github.com/pytorch/torchtitan/blob/main/torchtitan/experiments/compiler_toolkit/common_utils.py#L27-L33)

Also currently we directly import the parallelize_fn from simple_fsdp in the compiler_toolkit experiments folder. So we are always using the same SimpleFSDP as the simple_fsdp experiment folder.

mp_policy: MixedPrecisionPolicy | None,
reshard_after_forward: bool,
reduction_divide_factor: float | None,
full_dtensor: bool = False,
Copy link
Contributor

@ruisizhang123 ruisizhang123 Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we rename full_dtensor to sth more explicit (e.g., is_input_dtensor)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_input_dtensor sounds like we are using full dtensor because input is a dtensor, but the idea should be -- we are using full dtensor so input should be full dtensor, as well as the params. So I think full_dtensor or use_full_dtensor is OK for now. Eventually I think we should deprecate the non-full-dtensor paths.

mp_policy: MixedPrecisionPolicy | None,
reshard_after_forward: bool,
reduction_divide_factor: float | None,
full_dtensor: bool = False,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_input_dtensor sounds like we are using full dtensor because input is a dtensor, but the idea should be -- we are using full dtensor so input should be full dtensor, as well as the params. So I think full_dtensor or use_full_dtensor is OK for now. Eventually I think we should deprecate the non-full-dtensor paths.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants