-
Notifications
You must be signed in to change notification settings - Fork 27.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix gather when collecting 'num_input_tokens_seen' #31974
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that pacman no longer works at HF, so you can ping me in the future :)
This makes sense since we need to eventually gather
. cc @SunMarc
can you run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice ! Thanks for fixing !
@muellerzr Thanks for the pointer! I ran the code formatter as suggested. |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing!
* Move token count to device before gathering * Run 'make style; make quality'
* Move token count to device before gathering * Run 'make style; make quality'
* Move token count to device before gathering * Run 'make style; make quality'
* Move token count to device before gathering * Run 'make style; make quality'
What does this PR do?
Following from #29099, the allgather for
num_input_tokens_seen
is still getting stuck with distributed training, since the tensors are still on CPU and have not been moved to device in_prepare_inputs
yet. This still results in torch.distributed errors such as:This PR simply moves the token count to
self.args.device
and then moves them back to cpu after gathering.This problem was also mentioned in the discussion of issue #28791, but the issue was closed when the padding was fixed.
Who can review?
@pacman100
@muellerzr