-
Notifications
You must be signed in to change notification settings - Fork 27.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Try to fix training Loss inconsistent after resume from old checkpoint #25872
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
cc @muellerzr |
Hi @dumpmemory thanks! Can you do |
I will check it again. |
Done |
@dumpmemory what does the following show:
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Looks good to me and my tests all pass locally
@amyeroberts feel free to merge if it looks good with you |
I am ok for this pr 😁. Thanks for your support. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on fixing this! Overall change looks OK, however the logic should be simplified.
@muellerzr what was the reason for removing this logic originally?
Originally we had thought Accelerate handled this, but it turns out it does not |
@amyeroberts , pls help me to check the current version. |
@amyeroberts can the current version be merged ? is there any thing else i need to change, pls just tell me |
@dumpmemory please have a bit of patience, our team works across multiple timezones and have many other PR's and responsibilities to get to aside this one. We'll get to this when we can, please don't spam :) Thanks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for iterating - code it looking good!
Just a comment on the utility function - we want functions to be as atomic as possible. Once updated we'll be good to merge.
src/transformers/trainer_pt_utils.py
Outdated
def check_dataloader_randomsampler(dataloader): | ||
if hasattr(dataloader, "sampler") and isinstance(dataloader.sampler, RandomSampler): | ||
return dataloader.sampler, True | ||
if hasattr(dataloader, "batch_sampler"): | ||
return check_dataloader_randomsampler(dataloader.batch_sampler) | ||
return dataloader.sampler, False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should just return the sampler, and then the user can choose what they do with the output e.g. check if it's a random sampler. This ensures the function is as versatile as possible and can be used / extended without issue.
def check_dataloader_randomsampler(dataloader): | |
if hasattr(dataloader, "sampler") and isinstance(dataloader.sampler, RandomSampler): | |
return dataloader.sampler, True | |
if hasattr(dataloader, "batch_sampler"): | |
return check_dataloader_randomsampler(dataloader.batch_sampler) | |
return dataloader.sampler, False | |
def get_dataloader_sampler(dataloader): | |
if hasattr(dataloader, "sampler"): | |
return dataloader.sampler | |
if hasattr(dataloader, "batch_sampler"): | |
return get_dataloader_sampler(dataloader.batch_sampler) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as what i found in #25862, if hasattr(dataloader, "sampler"), might not be enough. after accelerate.prepare function, the dataloader.sampler changed from random to torch.utils.data.sampler.SequentialSampler. i will modify the code to just return sampler
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for iterating on this. Just one last comment on the structure of get_dataloader_sampler
@amyeroberts How about currently version. I have checked the sampler in final if statement. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dumpmemory Could you explain in some more detail why the suggested implementation of get_dataloader_sampler
isn't the one being used? For the current diff, it's not clear why some of the additional logic e.g. checking isinstance is added.
thanks @amyeroberts Co-authored-by: amyeroberts <[email protected]>
Thanks for your reviews. I think it is ready now. Thanks for your kind helping. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for iterating!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work!
@dumpmemory There's a current failing test (which I believe is unreleated to your PR). Could you rebase on main to include any recent updates on this branch and trigger a re-run of the CI? |
ok, i will do that |
huggingface#25872) * fix loss inconsistent after resume huggingface#25340 * fix typo * clean code * reformatted code * adjust code according to comments * adjust check_dataloader_randomsampler location * return sampler only * handle sampler is None * Update src/transformers/trainer_pt_utils.py thanks @amyeroberts Co-authored-by: amyeroberts <[email protected]> --------- Co-authored-by: amyeroberts <[email protected]>
huggingface#25872) * fix loss inconsistent after resume huggingface#25340 * fix typo * clean code * reformatted code * adjust code according to comments * adjust check_dataloader_randomsampler location * return sampler only * handle sampler is None * Update src/transformers/trainer_pt_utils.py thanks @amyeroberts Co-authored-by: amyeroberts <[email protected]> --------- Co-authored-by: amyeroberts <[email protected]>
huggingface#25872) * fix loss inconsistent after resume huggingface#25340 * fix typo * clean code * reformatted code * adjust code according to comments * adjust check_dataloader_randomsampler location * return sampler only * handle sampler is None * Update src/transformers/trainer_pt_utils.py thanks @amyeroberts Co-authored-by: amyeroberts <[email protected]> --------- Co-authored-by: amyeroberts <[email protected]>
What does this PR do?
Fixes #25340 (issue)
From my side, it might relate to the RandomSampler. i just recopy the logic from 4.29.2
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.