[`PEFT`] make the trainer support resume checkpoint from a named adapter #28531 #28547

chenbin11200 · 2024-01-17T08:03:28Z

What does this PR do?

Fixes # 28531
In peft>=0.5.0, when one initialize the PeftModel with a adapter name, like

peft_model = get_peft_model(model=base_model,
                                       peft_config=peft_config,
                                       adapter_name='my_lora_model_name',

In this case, the adapter_config.json and adapter_model.bin files will be saved in /my_output_dir/checkpoint-300/my_lora_model_name instead of /my_output_dir/checkpoint-300 directly. That will raise ValueError when trying to resume a training from checkpoint.
This PR is to fix this issue by join path into the subfolder, and load the adapter from the right subfolder(if necessary, because if one don't offer a adapter_name, the weight and config files will not be saved into a subfolder, this is also considered).

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum?
The bug is reported there but not yet discussed. https://github.com/huggingface/transformers/issues/28531
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?
No unit test, only tested locally for this small change.

Who can review?

@muellerzr and @pacman100

…ngface#28531

amyeroberts · 2024-01-17T09:32:58Z

cc @younesbelkada

younesbelkada

Makes sense, thank you ! Can you also run the styling checks? make fixup

chenbin11200 · 2024-01-19T04:34:17Z

Makes sense, thank you ! Can you also run the styling checks? make fixup

Hi @younesbelkada , I have made the styling checks. If there is still something missing, please informe me :D

younesbelkada

Thanks a lot for fixing!

HuggingFaceDocBuilderDev · 2024-01-19T10:38:28Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

chenbin11200 · 2024-01-23T06:10:25Z

Thanks a lot for fixing!

My pleasure :D
I guess it has to be approved by @amyeroberts before being merged?

younesbelkada · 2024-01-23T09:51:12Z

Yes @chenbin11200 - it will get reviewed as soon as possible and we'll merge the fix in main !

amyeroberts

Thanks for adding this!

Could you add a test to make sure we can resume from a checkpoint?

younesbelkada · 2024-01-23T13:33:48Z

@chenbin11200 the test would go here ideally: https://github.com/huggingface/transformers/blob/main/tests/peft_integration/test_peft_integration.py let me know if you need help designing the test!

chenbin11200 · 2024-01-24T08:34:12Z

@chenbin11200 the test would go here ideally: https://github.com/huggingface/transformers/blob/main/tests/peft_integration/test_peft_integration.py let me know if you need help designing the test!

Hi @younesbelkada @amyeroberts ,
Thanks for guiding me. Does this "resuming from checkpoint" test(or anything similar) already exist? Will save me some time if I could change on an exisiting one :)

younesbelkada · 2024-01-24T10:18:15Z

Hi @chenbin11200
After thinking a bit, to not complexify things, you can write a new test similar than this one: https://github.com/huggingface/transformers/blob/main/tests/trainer/test_trainer.py#L874 by adding the require_peft decorator and make sure training runs fine with resume checkpoint for a named adapter

chenbin11200 · 2024-01-25T08:49:48Z

Hi @chenbin11200 After thinking a bit, to not complexify things, you can write a new test similar than this one: https://github.com/huggingface/transformers/blob/main/tests/trainer/test_trainer.py#L874 by adding the require_peft decorator and make sure training runs fine with resume checkpoint for a named adapter

Thank you @younesbelkada, I will try that.

chenbin11200 · 2024-03-05T08:12:16Z

Hi @younesbelkada,
Sorry for disturbing, but I really need a hand on this unit test checking issue.
I have wrote a unit test for my changes, and it pass locally. However, looks like it breaks almost all the ci checking steps? And I have no idea how to check the error in detail. Hmm... looks weird to me, the code changes without the test pass them smoothly. Would you please take a look at it when you have time?
Thanks in advance :)

younesbelkada · 2024-03-05T08:14:10Z

No problem at all !
Hmm that's strange, can you merge your branch with upstream main?

chenbin11200 · 2024-03-08T08:57:00Z

No problem at all ! Hmm that's strange, can you merge your branch with upstream main?

@younesbelkada I see... After merging the main branch, the broken issues are gone. Cool, I didn't realized the checkings are combined with the origin master.
Thank you for the hint, then I guess this PR is ready for merge? Please check :D

amyeroberts

Thanks for iterating and adding tests!

Just a few small comments

tests/trainer/test_trainer.py

amyeroberts · 2024-03-08T10:41:46Z

tests/trainer/test_trainer.py

+        compute_metrics = kwargs.pop("compute_metrics", None)
+        optimizers = kwargs.pop("optimizers", (None, None))
+        output_dir = kwargs.pop("output_dir", "./regression")
+        preprocess_logits_for_metrics = kwargs.pop("preprocess_logits_for_metrics", None)


It's best to keep logic for tests as simple and explicit as possible. Here, rather than popping from kwargs, these should be kwargs with default values in get_regression_adapter_trainer

Basically you are right. I imitated the coding style of other test case in test_trainer.py, there mixes some arges in kwargs and pops them later. What do you think? To keep the coding style identity, or I could seperate them in method args, that's not a big effort.

I'd rather they were explicitly set, or, tbh because this method is only used in one place, just hardcoded

tests/trainer/test_trainer.py

chenbin11200 · 2024-03-12T09:38:27Z

@amyeroberts Hi, I have revoled most of the comments and left a single point to discuss with you. :)

younesbelkada · 2024-04-08T09:29:39Z

Hi @chenbin11200
Thanks for all your efforts! what is the status of this PR? Is it ready for a review? Can you also rebase with main?

github-actions · 2024-05-28T08:06:16Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

chenbin11200 · 2024-06-18T08:16:50Z

Hi @chenbin11200 Thanks for all your efforts! what is the status of this PR? Is it ready for a review? Can you also rebase with main?

Hi @younesbelkada , Sorry for delay, it was crazy days... This issue seems get fixed in #30505 , so this PR is safe to close. Thank you for your support and enthusiasm, looking forward to our next cooperation :D

younesbelkada · 2024-06-18T08:20:22Z

Thank you very much @chenbin11200 !

make the trainer support resume checkpoint from a named adapter huggi…

77a8b16

…ngface#28531

chenbin11200 mentioned this pull request Jan 17, 2024

A named Peft Model doesn't work with resume_from_checkpoint=True #28531

Closed

4 tasks

younesbelkada reviewed Jan 17, 2024

View reviewed changes

format code

9bc32c3

younesbelkada approved these changes Jan 19, 2024

View reviewed changes

younesbelkada requested a review from amyeroberts January 19, 2024 10:18

amyeroberts reviewed Jan 23, 2024

View reviewed changes

amyeroberts changed the title ~~make the trainer support resume checkpoint from a named adapter #28531~~ [PEFT] make the trainer support resume checkpoint from a named adapter #28531 Jan 23, 2024

chenbin11200 force-pushed the fix_resume_named_peft branch from 8c7c70c to c55c79c Compare February 18, 2024 09:48

add unit test

9cf826a

chenbin11200 force-pushed the fix_resume_named_peft branch from c55c79c to 9cf826a Compare March 5, 2024 07:39

Merge remote-tracking branch 'origin/main' into fix_resume_named_peft

5ea30c4

amyeroberts reviewed Mar 8, 2024

View reviewed changes

chenbin11200 added 2 commits March 12, 2024 11:58

changes for cr

b450265

reformat

6f2c7d4

huggingface deleted a comment from github-actions bot Apr 8, 2024

huggingface deleted a comment from github-actions bot May 3, 2024

github-actions bot closed this Jun 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[`PEFT`] make the trainer support resume checkpoint from a named adapter #28531 #28547

[`PEFT`] make the trainer support resume checkpoint from a named adapter #28531 #28547

chenbin11200 commented Jan 17, 2024

amyeroberts commented Jan 17, 2024

younesbelkada left a comment

chenbin11200 commented Jan 19, 2024

younesbelkada left a comment

HuggingFaceDocBuilderDev commented Jan 19, 2024

chenbin11200 commented Jan 23, 2024

younesbelkada commented Jan 23, 2024

amyeroberts left a comment

younesbelkada commented Jan 23, 2024

chenbin11200 commented Jan 24, 2024 •

edited

Loading

younesbelkada commented Jan 24, 2024

chenbin11200 commented Jan 25, 2024

chenbin11200 commented Mar 5, 2024

younesbelkada commented Mar 5, 2024

chenbin11200 commented Mar 8, 2024

amyeroberts left a comment

amyeroberts Mar 8, 2024

chenbin11200 Mar 12, 2024

amyeroberts Mar 12, 2024

chenbin11200 commented Mar 12, 2024

younesbelkada commented Apr 8, 2024

github-actions bot commented May 28, 2024

chenbin11200 commented Jun 18, 2024

younesbelkada commented Jun 18, 2024

[PEFT] make the trainer support resume checkpoint from a named adapter #28531 #28547

[PEFT] make the trainer support resume checkpoint from a named adapter #28531 #28547

Conversation

chenbin11200 commented Jan 17, 2024

What does this PR do?

Before submitting

Who can review?

amyeroberts commented Jan 17, 2024

younesbelkada left a comment

Choose a reason for hiding this comment

chenbin11200 commented Jan 19, 2024

younesbelkada left a comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Jan 19, 2024

chenbin11200 commented Jan 23, 2024

younesbelkada commented Jan 23, 2024

amyeroberts left a comment

Choose a reason for hiding this comment

younesbelkada commented Jan 23, 2024

chenbin11200 commented Jan 24, 2024 • edited Loading

younesbelkada commented Jan 24, 2024

chenbin11200 commented Jan 25, 2024

chenbin11200 commented Mar 5, 2024

younesbelkada commented Mar 5, 2024

chenbin11200 commented Mar 8, 2024

amyeroberts left a comment

Choose a reason for hiding this comment

amyeroberts Mar 8, 2024

Choose a reason for hiding this comment

chenbin11200 Mar 12, 2024

Choose a reason for hiding this comment

amyeroberts Mar 12, 2024

Choose a reason for hiding this comment

chenbin11200 commented Mar 12, 2024

younesbelkada commented Apr 8, 2024

github-actions bot commented May 28, 2024

chenbin11200 commented Jun 18, 2024

younesbelkada commented Jun 18, 2024

[`PEFT`] make the trainer support resume checkpoint from a named adapter #28531 #28547

[`PEFT`] make the trainer support resume checkpoint from a named adapter #28531 #28547

chenbin11200 commented Jan 24, 2024 •

edited

Loading