Try to fix training Loss inconsistent after resume from old checkpoint #25872

dumpmemory · 2023-08-30T14:23:17Z

What does this PR do?

Fixes #25340 (issue)

From my side, it might relate to the RandomSampler. i just recopy the logic from 4.29.2

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

HuggingFaceDocBuilderDev · 2023-08-30T15:29:01Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

amyeroberts · 2023-08-30T15:51:25Z

cc @muellerzr

muellerzr · 2023-08-30T15:53:10Z

Hi @dumpmemory thanks! Can you do pip install -e -U .[quality] and run make style; make quality again? This should fix that failing test

dumpmemory · 2023-08-30T15:56:03Z

patch-1

I will check it again.

dumpmemory · 2023-08-30T15:59:21Z

Hi @dumpmemory thanks! Can you do pip install -e -U .[quality] and run make style; make quality again? This should fix that failing test

Done

muellerzr · 2023-08-30T16:00:54Z

@dumpmemory what does the following show:

pip show black isort ruff

dumpmemory · 2023-08-30T16:02:13Z

pip show black isort ruff

@dumpmemory ➜ /workspaces/transformers (patch-1) $ pip show black isort ruff
Name: black
Version: 23.7.0
Summary: The uncompromising code formatter.
Home-page: 
Author: 
Author-email: Łukasz Langa <[email protected]>
License: MIT
Location: /usr/local/python/3.10.8/lib/python3.10/site-packages
Requires: click, mypy-extensions, packaging, pathspec, platformdirs, tomli
Required-by: 
---
Name: isort
Version: 5.12.0
Summary: A Python utility / library to sort Python imports.
Home-page: https://pycqa.github.io/isort/
Author: Timothy Crosley
Author-email: [email protected]
License: MIT
Location: /usr/local/python/3.10.8/lib/python3.10/site-packages
Requires: 
Required-by: 
---
Name: ruff
Version: 0.0.259
Summary: An extremely fast Python linter, written in Rust.
Home-page: https://github.com/charliermarsh/ruff
Author: Charlie Marsh <[email protected]>
Author-email: Charlie Marsh <[email protected]>
License: MIT
Location: /usr/local/python/3.10.8/lib/python3.10/site-packages
Requires: 
Required-by:

muellerzr

Thanks! Looks good to me and my tests all pass locally

muellerzr · 2023-08-30T16:23:57Z

@amyeroberts feel free to merge if it looks good with you

dumpmemory · 2023-08-30T16:26:07Z

@amyeroberts feel free to merge if it looks good with you

I am ok for this pr 😁. Thanks for your support.

amyeroberts

Thanks for working on fixing this! Overall change looks OK, however the logic should be simplified.

@muellerzr what was the reason for removing this logic originally?

src/transformers/trainer.py

muellerzr · 2023-08-30T17:38:27Z

Originally we had thought Accelerate handled this, but it turns out it does not

dumpmemory · 2023-08-31T02:09:28Z

@amyeroberts , pls help me to check the current version.

dumpmemory · 2023-08-31T12:44:52Z

@amyeroberts can the current version be merged ? is there any thing else i need to change, pls just tell me

muellerzr · 2023-08-31T12:47:30Z

@dumpmemory please have a bit of patience, our team works across multiple timezones and have many other PR's and responsibilities to get to aside this one. We'll get to this when we can, please don't spam :) Thanks

amyeroberts

Thanks for iterating - code it looking good!

Just a comment on the utility function - we want functions to be as atomic as possible. Once updated we'll be good to merge.

amyeroberts · 2023-08-31T18:36:31Z

src/transformers/trainer_pt_utils.py

+def check_dataloader_randomsampler(dataloader):
+    if hasattr(dataloader, "sampler") and isinstance(dataloader.sampler, RandomSampler):
+        return dataloader.sampler, True
+    if hasattr(dataloader, "batch_sampler"):
+        return check_dataloader_randomsampler(dataloader.batch_sampler)
+    return dataloader.sampler, False


This should just return the sampler, and then the user can choose what they do with the output e.g. check if it's a random sampler. This ensures the function is as versatile as possible and can be used / extended without issue.

Suggested change

def check_dataloader_randomsampler(dataloader):

if hasattr(dataloader, "sampler") and isinstance(dataloader.sampler, RandomSampler):

return dataloader.sampler, True

if hasattr(dataloader, "batch_sampler"):

return check_dataloader_randomsampler(dataloader.batch_sampler)

return dataloader.sampler, False

def get_dataloader_sampler(dataloader):

if hasattr(dataloader, "sampler"):

return dataloader.sampler

if hasattr(dataloader, "batch_sampler"):

return get_dataloader_sampler(dataloader.batch_sampler)

as what i found in #25862, if hasattr(dataloader, "sampler"), might not be enough. after accelerate.prepare function, the dataloader.sampler changed from random to torch.utils.data.sampler.SequentialSampler. i will modify the code to just return sampler

amyeroberts

Thanks for iterating on this. Just one last comment on the structure of get_dataloader_sampler

src/transformers/trainer_pt_utils.py

dumpmemory · 2023-09-06T08:41:41Z

@amyeroberts How about currently version. I have checked the sampler in final if statement.

amyeroberts

@dumpmemory Could you explain in some more detail why the suggested implementation of get_dataloader_sampler isn't the one being used? For the current diff, it's not clear why some of the additional logic e.g. checking isinstance is added.

src/transformers/trainer_pt_utils.py

@amyeroberts

thanks @amyeroberts Co-authored-by: amyeroberts <[email protected]>

dumpmemory · 2023-09-06T11:28:13Z

Thanks for your reviews. I think it is ready now. Thanks for your kind helping.

amyeroberts

Thanks for iterating!

muellerzr

Great work!

amyeroberts · 2023-09-06T15:50:57Z

@dumpmemory There's a current failing test (which I believe is unreleated to your PR). Could you rebase on main to include any recent updates on this branch and trigger a re-run of the CI?

dumpmemory · 2023-09-07T06:37:17Z

@dumpmemory There's a current failing test (which I believe is unreleated to your PR). Could you rebase on main to include any recent updates on this branch and trigger a re-run of the CI?

ok, i will do that

@amyeroberts

huggingface#25872) * fix loss inconsistent after resume huggingface#25340 * fix typo * clean code * reformatted code * adjust code according to comments * adjust check_dataloader_randomsampler location * return sampler only * handle sampler is None * Update src/transformers/trainer_pt_utils.py thanks @amyeroberts Co-authored-by: amyeroberts <[email protected]> --------- Co-authored-by: amyeroberts <[email protected]>

@amyeroberts

huggingface#25872) * fix loss inconsistent after resume huggingface#25340 * fix typo * clean code * reformatted code * adjust code according to comments * adjust check_dataloader_randomsampler location * return sampler only * handle sampler is None * Update src/transformers/trainer_pt_utils.py thanks @amyeroberts Co-authored-by: amyeroberts <[email protected]> --------- Co-authored-by: amyeroberts <[email protected]>

@amyeroberts

huggingface#25872) * fix loss inconsistent after resume huggingface#25340 * fix typo * clean code * reformatted code * adjust code according to comments * adjust check_dataloader_randomsampler location * return sampler only * handle sampler is None * Update src/transformers/trainer_pt_utils.py thanks @amyeroberts Co-authored-by: amyeroberts <[email protected]> --------- Co-authored-by: amyeroberts <[email protected]>

dumpmemory added 2 commits August 30, 2023 22:18

fix loss inconsistent after resume #25340

39c3fbd

fix typo

2823a53

dumpmemory changed the title ~~Try to fix #25340~~ Try to fix https://github.com/huggingface/transformers/issues/25340 Aug 30, 2023

dumpmemory changed the title ~~Try to fix https://github.com/huggingface/transformers/issues/25340~~ Try to fix training Loss inconsistent after resume from old checkpoint Aug 30, 2023

clean code

71010cb

reformatted code

b4c38a3

muellerzr approved these changes Aug 30, 2023

View reviewed changes

amyeroberts reviewed Aug 30, 2023

View reviewed changes

src/transformers/trainer.py Outdated Show resolved Hide resolved

src/transformers/trainer.py Outdated Show resolved Hide resolved

src/transformers/trainer.py Outdated Show resolved Hide resolved

dumpmemory added 2 commits August 31, 2023 02:02

adjust code according to comments

67223ae

adjust check_dataloader_randomsampler location

401d443

amyeroberts reviewed Aug 31, 2023

View reviewed changes

return sampler only

dfb06a5

dumpmemory requested a review from amyeroberts September 5, 2023 03:28

amyeroberts reviewed Sep 5, 2023

View reviewed changes

src/transformers/trainer_pt_utils.py Outdated Show resolved Hide resolved

handle sampler is None

807512c

amyeroberts reviewed Sep 6, 2023

View reviewed changes

src/transformers/trainer_pt_utils.py Outdated Show resolved Hide resolved

Update src/transformers/trainer_pt_utils.py

49146c8

thanks @amyeroberts Co-authored-by: amyeroberts <[email protected]>

amyeroberts approved these changes Sep 6, 2023

View reviewed changes

muellerzr approved these changes Sep 6, 2023

View reviewed changes

Merge branch 'huggingface:main' into patch-1

af7ce6d

amyeroberts merged commit fb7d246 into huggingface:main Sep 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Try to fix training Loss inconsistent after resume from old checkpoint #25872

Try to fix training Loss inconsistent after resume from old checkpoint #25872

dumpmemory commented Aug 30, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Aug 30, 2023

amyeroberts commented Aug 30, 2023

muellerzr commented Aug 30, 2023

dumpmemory commented Aug 30, 2023

dumpmemory commented Aug 30, 2023

muellerzr commented Aug 30, 2023

dumpmemory commented Aug 30, 2023

muellerzr left a comment

muellerzr commented Aug 30, 2023

dumpmemory commented Aug 30, 2023

amyeroberts left a comment

muellerzr commented Aug 30, 2023

dumpmemory commented Aug 31, 2023 •

edited

Loading

dumpmemory commented Aug 31, 2023 •

edited

Loading

muellerzr commented Aug 31, 2023

amyeroberts left a comment

amyeroberts Aug 31, 2023

dumpmemory Sep 1, 2023 •

edited

Loading

amyeroberts left a comment

dumpmemory commented Sep 6, 2023

amyeroberts left a comment

dumpmemory commented Sep 6, 2023

amyeroberts left a comment

muellerzr left a comment

amyeroberts commented Sep 6, 2023

dumpmemory commented Sep 7, 2023

Try to fix training Loss inconsistent after resume from old checkpoint #25872

Try to fix training Loss inconsistent after resume from old checkpoint #25872

Conversation

dumpmemory commented Aug 30, 2023 • edited Loading

What does this PR do?

Before submitting

Who can review?

HuggingFaceDocBuilderDev commented Aug 30, 2023

amyeroberts commented Aug 30, 2023

muellerzr commented Aug 30, 2023

dumpmemory commented Aug 30, 2023

dumpmemory commented Aug 30, 2023

muellerzr commented Aug 30, 2023

dumpmemory commented Aug 30, 2023

muellerzr left a comment

Choose a reason for hiding this comment

muellerzr commented Aug 30, 2023

dumpmemory commented Aug 30, 2023

amyeroberts left a comment

Choose a reason for hiding this comment

muellerzr commented Aug 30, 2023

dumpmemory commented Aug 31, 2023 • edited Loading

dumpmemory commented Aug 31, 2023 • edited Loading

muellerzr commented Aug 31, 2023

amyeroberts left a comment

Choose a reason for hiding this comment

amyeroberts Aug 31, 2023

Choose a reason for hiding this comment

dumpmemory Sep 1, 2023 • edited Loading

Choose a reason for hiding this comment

amyeroberts left a comment

Choose a reason for hiding this comment

dumpmemory commented Sep 6, 2023

amyeroberts left a comment

Choose a reason for hiding this comment

dumpmemory commented Sep 6, 2023

amyeroberts left a comment

Choose a reason for hiding this comment

muellerzr left a comment

Choose a reason for hiding this comment

amyeroberts commented Sep 6, 2023

dumpmemory commented Sep 7, 2023

dumpmemory commented Aug 30, 2023 •

edited

Loading

dumpmemory commented Aug 31, 2023 •

edited

Loading

dumpmemory commented Aug 31, 2023 •

edited

Loading

dumpmemory Sep 1, 2023 •

edited

Loading