-
Notifications
You must be signed in to change notification settings - Fork 19.6k
Allow per-variable optimizer, add DispatchOptimizer. #21196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #21196 +/- ##
===========================================
- Coverage 82.59% 61.64% -20.95%
===========================================
Files 564 565 +1
Lines 54408 54592 +184
Branches 8449 8487 +38
===========================================
- Hits 44937 33652 -11285
- Misses 7396 18860 +11464
- Partials 2075 2080 +5
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
- Adds a property `variable.optimizer` that defaults to `None` - Adds a `DispatchOptimizer` that scans the list of trainable variables during build, collects all unique per-variable optimizers, then dispatches the apply/stateless_apply function to the correct optimizer if applicable. - Modifies `trainer` so that during the optimizer build stage, checks if any variables have a custom optimizer attached, and if so inserts a `DispatchOptimizer` to properly handle them. This allows usage to be hidden from the user. Context: for large embedding tables, we need special optimizers to be used so that the tables can be updated in-place, rather than returning large gradients. The layer will handle setting of the custom optimizers, but we need the trainer to be aware of them and dispatch the embedding tables to different optimizers appropriately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. Could we avoid the changes in trainer.py
and instead limit the impact to the optimizer? The optimizer can check whether its variables has an overridden optimizer, for instance.
We could also require the user to pass a custom optimizer to compile()
for distributed embeddings to work. Maybe that's better?
The optimizer can't replace itself, so we can't "insert" a I was originally going to modify
Yes, we could. It requires the user to know about it. The base optimizer could throw an error if it encounters a variable with an optimizer attached, telling the user to use a |
Understood. How about:
If possible at all we should also consider just automating the above -- when a model gets compiled, we check if is has distributed embeddings, and if so we replace the optimizer with a |
It's possible, but I wanted to keep it a bit more general than that since there are other contexts in which we need to treat specific variables differently. For example, We also don't have a concept of "distributed embedding" in core Keras, only Keras RS.
This is exactly what I tried in this PR - except we can't detect "distributed embeddings" in Thinking out loud: Maybe "optimizer" is the wrong word - what we really need is a method to treat "gradients" as auxiliary data rather than true gradients, and allow optimizers to dispatch the variable update to a custom "updater" that takes in that auxiliary data. We also don't want that auxiliary data modified in any way - i.e. avoid scaling by loss-scaling, learning rates. And we don't want optimizers to create their own extra optimizer-variables for these either. Edit: though we may need the "iteration" (or step count) from the optimizer, since the "updater" may need this information from the optimizer to update internal state - unless we track that ourselves for every update call. |
I like the generality, we can definitely add a generic variable attribute for this. Is "optimizer" the right abstraction though? Two issues I see:
|
I agree. It's not really that the variable owns the optimizer - this is just the most convenient way I could think of to attach the information. There seemed to be precedence with that It's the layer that knows that certain variables that it owns need special handling. In this case, the layer knows that these large embedding tables can't use a traditional optimizer. The layer needs some way to tell the model how to handle them, and specifically that the table variables need special optimizers. Adding the optimizer as an attribute on the variable was the communication mechanism for this. The other way this could be done is with a map layer.variable_path_to_optimizer_map: dict[str, keras.optimizers.Optimizer] that the model could query in order to build up a set of optimizers. If empty or |
Could consider some design where this is handled at a custom layer level? Go for a similar vibe to add_loss https://keras.io/api/losses/#the-addloss-api where a optimizer is attached during build or init. Or give a custom layer the ability to define it's own custom apply step somehow. No idea if these are good ideas :). Just drive by thoughts. |
We need to mark the variable as There apparently used to be an It's all a lot more complicated though than simply allowing us to specify a custom optimizer for specific variables, which essentially accomplishes the same thing. |
IMO...
|
variable.optimizer
that defaults toNone
DispatchOptimizer
that scans the list of trainable variables during build, collects all unique per-variable optimizers, then dispatches the apply/stateless_apply function to the correct optimizer if applicable.trainer
so that during the optimizer build stage, checks if any variables have a custom optimizer attached, and if so inserts aDispatchOptimizer
to properly handle them. This allows usage to be hidden from the user.Context: for large embedding tables, we need special optimizers to be used so that the tables can be updated in-place, rather than returning large gradients. The layer will handle setting of the custom optimizers, but we need the trainer to be aware of them and dispatch the embedding tables to different optimizers appropriately.