-
Notifications
You must be signed in to change notification settings - Fork 598
[Full DTensor] Add full_dtensor flag #2002
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: gh/fegin/27/base
Are you sure you want to change the base?
Conversation
When full_dtensor is True, the compute_placement will be preserved. This means that `to_local()` won't be called for fsdp only case. nD parallelism case (fsdp + tp) will error out as we have not implemented this case. This argument doesn't affect the current simple_fsdp. We have verified `full_dtensor=True` case with the full dtensor skleton PR, which will be published once it is ready. ghstack-source-id: 43384a7 Pull-Request: #2002
When full_dtensor is True, the compute_placement will be preserved. This means that `to_local()` won't be called for fsdp only case. nD parallelism case (fsdp + tp) will error out as we have not implemented this case. This argument doesn't affect the current simple_fsdp. We have verified `full_dtensor=True` case with the full dtensor skleton PR, which will be published once it is ready. ghstack-source-id: b8a288a Pull-Request: #2002
Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0) (oldest at bottom): * #2002 * #2001 * #1995 * __->__ #1985 We are adding more actions to convert the raw inputs and label. 1. The new CP can do the input/label/BlockMask sharding this in this method. 2. The experimental full dtensor model can simply override this method without changing too many Trainer code. This method is extracted from #1857 Makeing this a standalone PR allows us to continue the two projects above without one blocks another.
When full_dtensor is True, the compute_placement will be preserved. This means that `to_local()` won't be called for fsdp only case. nD parallelism case (fsdp + tp) will error out as we have not implemented this case. This argument doesn't affect the current simple_fsdp. We have verified `full_dtensor=True` case with the full dtensor skleton PR, which will be published once it is ready. ghstack-source-id: 9f9efce Pull-Request: #2002
Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0) (oldest at bottom): * #2002 * #2001 * __->__ #1995 People are creating different train.py and duplicate the `main` function. But in realitly people just want to use different Trainer subclasses. This PR creates a main() in torchtitan/train.py to deduplicate the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought the full dtensor version is in compiler toolkit, isn't it? cc. @SherlockNoMad @yiming0416
Do we have a plan to migrate full dtensor to simplefsdp folder?
@ruisizhang123 Not really, DTensorizing inputs in compiler toolkit only applies to the Also currently we directly import the |
| mp_policy: MixedPrecisionPolicy | None, | ||
| reshard_after_forward: bool, | ||
| reduction_divide_factor: float | None, | ||
| full_dtensor: bool = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we rename full_dtensor to sth more explicit (e.g., is_input_dtensor)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is_input_dtensor sounds like we are using full dtensor because input is a dtensor, but the idea should be -- we are using full dtensor so input should be full dtensor, as well as the params. So I think full_dtensor or use_full_dtensor is OK for now. Eventually I think we should deprecate the non-full-dtensor paths.
| mp_policy: MixedPrecisionPolicy | None, | ||
| reshard_after_forward: bool, | ||
| reduction_divide_factor: float | None, | ||
| full_dtensor: bool = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is_input_dtensor sounds like we are using full dtensor because input is a dtensor, but the idea should be -- we are using full dtensor so input should be full dtensor, as well as the params. So I think full_dtensor or use_full_dtensor is OK for now. Eventually I think we should deprecate the non-full-dtensor paths.
Stack from ghstack (oldest at bottom):
When full_dtensor is True, the compute_placement will be preserved. This means that
to_local()won't be called for fsdp only case. nD parallelism case (fsdp + tp) will error out as we have not implemented this case.This argument doesn't affect the current simple_fsdp. We have verified
full_dtensor=Truecase with the full dtensor skleton PR, which will be published once it is ready.