Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PEFT] make the trainer support resume checkpoint from a named adapter #28531 #28547

Closed

Conversation

chenbin11200
Copy link

What does this PR do?

Fixes # 28531
In peft>=0.5.0, when one initialize the PeftModel with a adapter name, like

peft_model = get_peft_model(model=base_model,
                                       peft_config=peft_config,
                                       adapter_name='my_lora_model_name',

In this case, the adapter_config.json and adapter_model.bin files will be saved in /my_output_dir/checkpoint-300/my_lora_model_name instead of /my_output_dir/checkpoint-300 directly. That will raise ValueError when trying to resume a training from checkpoint.
This PR is to fix this issue by join path into the subfolder, and load the adapter from the right subfolder(if necessary, because if one don't offer a adapter_name, the weight and config files will not be saved into a subfolder, this is also considered).

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum?
    The bug is reported there but not yet discussed. https://github.com/huggingface/transformers/issues/28531
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?
    No unit test, only tested locally for this small change.

Who can review?

@muellerzr and @pacman100

@amyeroberts
Copy link
Collaborator

cc @younesbelkada

Copy link
Contributor

@younesbelkada younesbelkada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, thank you ! Can you also run the styling checks? make fixup

@chenbin11200
Copy link
Author

Makes sense, thank you ! Can you also run the styling checks? make fixup

Hi @younesbelkada , I have made the styling checks. If there is still something missing, please informe me :D

Copy link
Contributor

@younesbelkada younesbelkada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for fixing!

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@chenbin11200
Copy link
Author

Thanks a lot for fixing!

My pleasure :D
I guess it has to be approved by @amyeroberts before being merged?

@younesbelkada
Copy link
Contributor

Yes @chenbin11200 - it will get reviewed as soon as possible and we'll merge the fix in main !

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this!

Could you add a test to make sure we can resume from a checkpoint?

@amyeroberts amyeroberts changed the title make the trainer support resume checkpoint from a named adapter #28531 [PEFT] make the trainer support resume checkpoint from a named adapter #28531 Jan 23, 2024
@younesbelkada
Copy link
Contributor

@chenbin11200 the test would go here ideally: https://github.com/huggingface/transformers/blob/main/tests/peft_integration/test_peft_integration.py let me know if you need help designing the test!

@chenbin11200
Copy link
Author

chenbin11200 commented Jan 24, 2024

@chenbin11200 the test would go here ideally: https://github.com/huggingface/transformers/blob/main/tests/peft_integration/test_peft_integration.py let me know if you need help designing the test!

Hi @younesbelkada @amyeroberts ,
Thanks for guiding me. Does this "resuming from checkpoint" test(or anything similar) already exist? Will save me some time if I could change on an exisiting one :)

@younesbelkada
Copy link
Contributor

Hi @chenbin11200
After thinking a bit, to not complexify things, you can write a new test similar than this one: https://github.com/huggingface/transformers/blob/main/tests/trainer/test_trainer.py#L874 by adding the require_peft decorator and make sure training runs fine with resume checkpoint for a named adapter

@chenbin11200
Copy link
Author

Hi @chenbin11200 After thinking a bit, to not complexify things, you can write a new test similar than this one: https://github.com/huggingface/transformers/blob/main/tests/trainer/test_trainer.py#L874 by adding the require_peft decorator and make sure training runs fine with resume checkpoint for a named adapter

Thank you @younesbelkada, I will try that.

@chenbin11200 chenbin11200 force-pushed the fix_resume_named_peft branch from c55c79c to 9cf826a Compare March 5, 2024 07:39
@chenbin11200
Copy link
Author

Hi @younesbelkada,
Sorry for disturbing, but I really need a hand on this unit test checking issue.
I have wrote a unit test for my changes, and it pass locally. However, looks like it breaks almost all the ci checking steps? And I have no idea how to check the error in detail. Hmm... looks weird to me, the code changes without the test pass them smoothly. Would you please take a look at it when you have time?
Thanks in advance :)

@younesbelkada
Copy link
Contributor

No problem at all !
Hmm that's strange, can you merge your branch with upstream main?

@chenbin11200
Copy link
Author

No problem at all ! Hmm that's strange, can you merge your branch with upstream main?

@younesbelkada I see... After merging the main branch, the broken issues are gone. Cool, I didn't realized the checkings are combined with the origin master.
Thank you for the hint, then I guess this PR is ready for merge? Please check :D

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for iterating and adding tests!

Just a few small comments

tests/trainer/test_trainer.py Outdated Show resolved Hide resolved
tests/trainer/test_trainer.py Outdated Show resolved Hide resolved
Comment on lines +3510 to +3513
compute_metrics = kwargs.pop("compute_metrics", None)
optimizers = kwargs.pop("optimizers", (None, None))
output_dir = kwargs.pop("output_dir", "./regression")
preprocess_logits_for_metrics = kwargs.pop("preprocess_logits_for_metrics", None)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's best to keep logic for tests as simple and explicit as possible. Here, rather than popping from kwargs, these should be kwargs with default values in get_regression_adapter_trainer

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically you are right. I imitated the coding style of other test case in test_trainer.py, there mixes some arges in kwargs and pops them later. What do you think? To keep the coding style identity, or I could seperate them in method args, that's not a big effort.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather they were explicitly set, or, tbh because this method is only used in one place, just hardcoded

tests/trainer/test_trainer.py Outdated Show resolved Hide resolved
tests/trainer/test_trainer.py Outdated Show resolved Hide resolved
tests/trainer/test_trainer.py Outdated Show resolved Hide resolved
@chenbin11200
Copy link
Author

@amyeroberts Hi, I have revoled most of the comments and left a single point to discuss with you. :)

@huggingface huggingface deleted a comment from github-actions bot Apr 8, 2024
@younesbelkada
Copy link
Contributor

Hi @chenbin11200
Thanks for all your efforts! what is the status of this PR? Is it ready for a review? Can you also rebase with main?

@huggingface huggingface deleted a comment from github-actions bot May 3, 2024
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot closed this Jun 5, 2024
@chenbin11200
Copy link
Author

Hi @chenbin11200 Thanks for all your efforts! what is the status of this PR? Is it ready for a review? Can you also rebase with main?

Hi @younesbelkada , Sorry for delay, it was crazy days... This issue seems get fixed in #30505 , so this PR is safe to close. Thank you for your support and enthusiasm, looking forward to our next cooperation :D

@younesbelkada
Copy link
Contributor

Thank you very much @chenbin11200 !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants