Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Comparison with PEFT #19

Open
LaVieEnRose365 opened this issue Mar 14, 2024 · 1 comment
Open

Comparison with PEFT #19

LaVieEnRose365 opened this issue Mar 14, 2024 · 1 comment

Comments

@LaVieEnRose365
Copy link

Hi there! It's really an interesting work, but I have following questions:

  1. I think the proposed block expansion is quite similar to the idea of Adapter Tuning, can you explain the main difference?
  2. The results demonstrate that more expansion blocks lead to better results, which totally add 1B additional parameters. And block expansion is claimed to be superior to Lora. However, the low rank property of Lora actually leads to few parameters. Did you compare the performance of block expansion and Lora under the number of additional parameters ?
    It really will be a pleasure if you can reply to me.
@hills-code
Copy link
Collaborator

Thanks for your attention!

I think the main difference between our work and PEFT methods is that we scale the parameters. We have observed the power of scaling like GPT, Claude, and so on. We did the experiment that the PEFT method tunes as much as parameters we scale for LoRA, however, it can not generalize well in the specific domain. We hypothesized that the PEFT method has its limitations in having the capacity to learn more knowledge, which is important in the (continual) pretraining. It is useful when doing SFT, as recently one group mentioned that at the SFT stage, the model mainly learns the style or format URIAL. I think the PEFT method is more suitable for tasks like learning the style or format, while not for learning more knowledge, which requires dense parameters to hold in the pretraining.

Recently, another interesting work also mentions this property, yi-9b. It also uses depth expansion and then trains on math and code corpus. It mentions that if they do not scale the parameters, the continual training only marginally improves the performance.

So basically I think the main difference is that we try to increase the parameters based on the initial model to do the continual training, while PEFT is more suitable for the following SFT.

I hope this will be helpful!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants