-
Notifications
You must be signed in to change notification settings - Fork 976
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEAT: Add LOMO optimizer #2695
FEAT: Add LOMO optimizer #2695
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very good start, left some design comments that are more in-tune to what I'd expect for Accelerate :)
src/accelerate/accelerator.py
Outdated
if is_lomo_available(): | ||
from lomo_optim import AdaLomo, Lomo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be done at the top of the file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing that leads to an error due to circular imports 😢 this is because lomo_optim
imports stuff from transformers, that itself import stuff from accelerate. Added a comment there
src/accelerate/accelerator.py
Outdated
elif learning_rate is not None and self._has_lomo_optimizer: | ||
self._lomo_backward(loss, learning_rate) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to interact with self.scaler
for AMP? That may be important!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm not sure here actually 🤯 I need to investigate a bit ..
if is_lomo_available(): | ||
from lomo_optim import AdaLomo, Lomo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same thing as earlier, but let's move the detection for Lomo to happen in here rather than in the Accelerator. Then we can make an attribute self.is_lomo_optimizer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see: #2695 (comment)
src/accelerate/accelerator.py
Outdated
@@ -2031,6 +2041,8 @@ def backward(self, loss, **kwargs): | |||
return | |||
elif self.scaler is not None: | |||
self.scaler.scale(loss).backward(**kwargs) | |||
elif learning_rate is not None and self._has_lomo_optimizer: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eventually, if we have more optimizers needing this, we may want to abstract this to a backward_func
at some point
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes makes sense!
src/accelerate/accelerator.py
Outdated
@@ -2031,6 +2041,8 @@ def backward(self, loss, **kwargs): | |||
return | |||
elif self.scaler is not None: | |||
self.scaler.scale(loss).backward(**kwargs) | |||
elif learning_rate is not None and self._has_lomo_optimizer: | |||
self._lomo_backward(loss, learning_rate) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make this lomo_backwards
instead of hiding it.
Also, I'd rather see us raise an error if a learning rate isn't part of kwargs, and this is True
, rather than silently failing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes totally sense, refactored a bit things in 3a518dc - LMK what do you think !
src/accelerate/accelerator.py
Outdated
if is_lomo_available(): | ||
from lomo_optim import AdaLomo, Lomo | ||
else: | ||
raise ValueError("`lomo_optim` package is needed to call backward on LOMO optimizers") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should already be caught in the prepare_optimizer
portion, so it should be fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, removed it !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! This is looking fantastic! Last bit I’d like is a quick example in examples/ using LOMO, but for the sake of the release today we can get this in.
Nice, thanks so much ! OK I will work on an example that use accelerate + LOMO and open a follow up PR ! |
# transformers & accelerate | ||
from lomo_optim import AdaLomo, Lomo | ||
|
||
self.has_lomo_optimizer = isinstance(optimizer, (Lomo, AdaLomo)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my understanding, we assume that the accelerator could have multiple self._optimizers
, with some of them LOMO and others not. Would that not create the issue that self.has_lomo_optimizer
takes the value based on whatever the last optimizer is? Would we not have to set: self.has_lomo_optimizer |= isinstance(optimizer, (Lomo, AdaLomo))
WIP - on par with huggingface/transformers#30178