-
Notifications
You must be signed in to change notification settings - Fork 27.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix _save_tpu: use _maybe_convert_to_cpu instead of to cpu. #31264
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! A quick comment I'd like addressed please
@@ -3423,28 +3421,25 @@ def _save_tpu(self, output_dir: Optional[str] = None): | |||
self.accelerator.unwrap_model(model).save_pretrained( | |||
output_dir, | |||
is_main_process=self.args.should_save, | |||
state_dict=model.state_dict(), | |||
state_dict=xm._maybe_convert_to_cpu(model.state_dict()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm concerned by the fact we need to rely on a private _maybe_convert_to_cpu
func here, what version of torch_xla was this introduced? Recently?
(We need to be careful with that, even if it's still broken for users not on that xla version)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This func has always been present in torch_xla, and it should have been available as early as version 1.5.0.
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! Overall LG2M then. Can you rebase from main to hopefully snag those failing tests? cc @amyeroberts
@@ -3407,8 +3407,6 @@ def _save_tpu(self, output_dir: Optional[str] = None): | |||
logger.info(f"Saving model checkpoint to {output_dir}") | |||
model = self.model | |||
xm.mark_step() | |||
if self.args.save_safetensors: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will you check if your change works with args.save_safetensors=False?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests pass with save_safetensors=False
. Actually, xm.save
will also call _maybe_convert_to_cpu
and when the tensor is already converted to CPU, this function does nothing. Therefore, when save_safetensors=True
, it's necessary to first convert state_dict to CPU, and when save_safetensors=False
, converting it to CPU before xm.save
also does not introduce additional costs.
39010de
to
3b7823c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM ! Could you have a look @shub-kris as you created the issue in the first place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for digging into this and fixing!
…ace#31264) * Fix _save_tpu: use _maybe_convert_to_cpu instead of to cpu. * fix lint
…ace#31264) * Fix _save_tpu: use _maybe_convert_to_cpu instead of to cpu. * fix lint
What does this PR do?
When saving a checkpoint, we execute a
model.to('cpu')
transition and subsequently revert it withmodel.to('xla')
. While these operations do not directly affect the model, the optimizer remains connected to the model parameters as they were before being moved to 'cpu'. Consequently, after saving checkpoint, there's a misalignment between the optimizer's parameters and those within the model. Followingoptimizer.step()
, executingzero_grad()
sets the gradients to None. Consequently, the parameters associated with that optimizer will not receive gradient updates anymore (the optimizer is no longer associated with the model) and keep to None. Therefore, when performing all_reduce, a situation of all_reduce([]) may occur, leading to a core dump in torch_xla.This commit utilizes
_maybe_convert_to_cpu
to replace the operations ofto('cpu')
.Fixes pytorch/xla#6620
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.