Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why loading a lora weights so low? #8953

Closed
zengjie617789 opened this issue Jul 24, 2024 · 18 comments
Closed

Why loading a lora weights so low? #8953

zengjie617789 opened this issue Jul 24, 2024 · 18 comments
Labels

Comments

@zengjie617789
Copy link

zengjie617789 commented Jul 24, 2024

I used diffusers to load lora weights but it much slow to finish.
diffusers version: 0.29.2

I test another version of diffusers 0.23.0 without peft installation, and the time is decent.

t1 = time.time()
pipe.load_lora_weights("/data/**/lora_weights/lcm-lora-sdxl/", weight_name="pytorch_lora_weights.safetensors")
print(f"load lcm lora weights cost: {time.time()- t1}")

image
image

And If I use low version of diffusers, much of code need to be modified which cost much work.
Anyone who can help me will be appreciate.

@tolgacangoz
Copy link
Contributor

Hmm, you seem right. I observed similar 3-5 times differences in speed between 0.27<=diffusers<=0.29.1 w peft and 0.23<=diffusers<=0.26 w/o peft. At version diffusers==0.27, peft was set to compulsory. peft really seems to could add overhead 🤔.

@a-r-r-o-w
Copy link
Member

cc @sayakpaul

@sayakpaul
Copy link
Member

Cc: @BenjaminBossan

@BenjaminBossan
Copy link
Member

Could you please provide a full code example so that I can try to reproduce these timings? Also, what version of PEFT do you have installed?

@zengjie617789
Copy link
Author

zengjie617789 commented Jul 25, 2024

@BenjaminBossan The code are plain code to load model and run inference, and I load LCM lora weight as a test as the same as using other loras. Code snippets below like:

pipe = StableDiffusionXLPipeline.from_pretrained(model_path,  torch_dtype=torch.float16, variant="fp16", use_safetensors=True).to(device)
t1 = time.time()
pipe.load_lora_weights("/data/***/models/lora_weights/lcm-lora-sdxl/", weight_name="pytorch_lora_weights.safetensors")
print(f"load lcm lora weights cost: {time.time()- t1}")

diffusers==0.25.1 peft==0.12.0
the gap of time are because of the installation of peft or not.
image
image

@BenjaminBossan
Copy link
Member

Thanks for providing more context. Since you use a private adapter, I changed the code a little bit to use one from the Hub:

import time

import torch
from diffusers import StableDiffusionXLPipeline

try:
    import peft
    print("peft is installed")
    print(peft.__version__)
    print(peft.__path__)
except ImportError:
    print("peft is not installed")

device = 0
model_path = "stabilityai/stable-diffusion-xl-base-1.0"

pipe = StableDiffusionXLPipeline.from_pretrained(
    model_path, torch_dtype=torch.float16, variant="fp16", use_safetensors=True,
).to(device)
t1 = time.time()
pipe.load_lora_weights("latent-consistency/lcm-lora-sdxl", weight_name="pytorch_lora_weights.safetensors")
print(f"load lcm lora weights time: {time.time()- t1:.2f}s")

With this, I can confirm these numbers (averaged over 5 runs):

  • diffusers v0.23.0, no PEFT: ~0.7s
  • diffusers v0.23.0, PEFT v0.12.0: ~2.0s
  • diffusers v0.29.0, PEFT v.012.0: ~2.0s

I investigated further why PEFT is slower and what I found is that with PEFT, when the LoRA layer is created (before the actual checkpoint is loaded), PEFT will instantiate actual LoRA weights. Without PEFT, empty weights are created (meta device).

Next I tried to turn off using empty weights to check if that really makes the difference. However, when passing low_cpu_mem_usage=False to load_lora_weights, which from my understanding controls this behavior, I found that the value is not correctly passed down, so that here, it is set to True again. Not sure if this is intentional or not. To get it to work, I had to hard-code low_cpu_mem_usage=False in this line.

Long story short, after forcing this, I get very similar load times of ~1.7s. So in a sense this is a regression, as using the PEFT backend in diffusers is not loading empty weights. I'm not quite sure if this is an oversight, a bug, or intended for some reason. @sayakpaul do you think this could be added for PEFT LoRA loading?

@sayakpaul
Copy link
Member

We had brought it up in the past a couple of times but as you can see the difference is not really that significant so, I am not sure about the value add here given the number of lines of code to be added to PEFT.

@sayakpaul
Copy link
Member

I also don’t understand what was incorrectly passed down for low_cpu_mem_usage in the function you mentioned. Do you mean the value was overridden when you passed low_cpu_mem_usage=False?

@BenjaminBossan
Copy link
Member

We had brought it up in the past a couple of times but as you can see the difference is not really that significant so, I am not sure about the value add here given the number of lines of code to be added to PEFT.

Well, it's up to you and the other diffusers maintainers if you think this is worth fixing or not. On the PEFT side, I think it could be a feature to allow passing some argument to inject_adapter_in_model to tell PEFT to load the weights only on meta device, then this change would be easy to make on the diffusers side. I'll put it on the backlog and can notify you once we the feature is added.

I also don’t understand what was incorrectly passed down for low_cpu_mem_usage in the function you mentioned. Do you mean the value was overridden when you passed low_cpu_mem_usage=False?

The value is not passed on in this line:

https://github.com/huggingface/diffusers/blob/v0.23.0/src/diffusers/loaders.py#L3234

It is set correctly in the kwargs but is not passed to load_lora_into_unet (not sure if intentional or not). As this is v0.23.0, there is probably not any TODO from this finding, even if it's an error.

@sayakpaul
Copy link
Member

Ah you are right. But we have moved from this already. So, I think we can ignore it for now.

But on a second thought, I think faster loading support would indeed be nice especially for larger adapters. Please let us know when this has been shipper in PEFT and we can do the necessary changes in diffusers.

@BenjaminBossan
Copy link
Member

BenjaminBossan commented Jul 26, 2024

I did some quick and dirty experiments and created a draft PR based on those. There is indeed a speed up for Llama3 8B in this test, although loading the base model is still the overall bottleneck (even with hot cache and M.2 SSD).

Overall, I thus think the impact of this feature will not be huge, especially since this a one time cost, but implementing it is not trivial. Therefore, I'll leave this in a draft PR stage for the time being while working on higher priority items.

@sayakpaul
Copy link
Member

This is what I had expected too. Thanks for quickly checking!

@zengjie617789
Copy link
Author

Thank you for you all. I test the version as refered, the version are as below:
image
image
image

but the time cost much long,
image

If I test it without peft:
image

The nvidia card is A100.
Obviously, the difference between with peft or not is huge. And I wonder how to get the result of 0.7s vs 2s.

@BenjaminBossan
Copy link
Member

I test the version as refered, the version are as below:

Just to be clear, the test that I mentioned is in a draft PR, so it is not available on PEFT yet. And even if you installed PEFT from my branch, it would still not work because diffusers would need to make some changes too to opt into that feature. I will get back to that PR when I have a bit of extra time.

And I wonder how to get the result of 0.7s vs 2s.

These were based on another adapter ("latent-consistency/lcm-lora-sdxl") since you seem to use a private one that I cannot load. Also, a lot depends on the hardware. When you run the snippet I posted above, what times do you get? Please run a couple of times to warm up the cache and lower the variance.

Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot added the stale Issues that haven't received updates label Sep 14, 2024
@BenjaminBossan
Copy link
Member

not stale, huggingface/peft#1961 is being worked on

@BenjaminBossan BenjaminBossan removed the stale Issues that haven't received updates label Sep 16, 2024
@yiyixuxu yiyixuxu added the peft label Sep 20, 2024
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot added the stale Issues that haven't received updates label Oct 15, 2024
@a-r-r-o-w a-r-r-o-w removed the stale Issues that haven't received updates label Oct 15, 2024
@sayakpaul
Copy link
Member

With #9510 and the latest versions of peft and transformers (installed from main as of now), this should be significantly improved. So, I am going to close this. Feel free to reopen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants