Why loading a lora weights so low? #8953

zengjie617789 · 2024-07-24T06:16:42Z

I used diffusers to load lora weights but it much slow to finish.
diffusers version: 0.29.2

I test another version of diffusers 0.23.0 without peft installation, and the time is decent.

t1 = time.time()
pipe.load_lora_weights("/data/**/lora_weights/lcm-lora-sdxl/", weight_name="pytorch_lora_weights.safetensors")
print(f"load lcm lora weights cost: {time.time()- t1}")

And If I use low version of diffusers, much of code need to be modified which cost much work.
Anyone who can help me will be appreciate.

The text was updated successfully, but these errors were encountered:

tolgacangoz · 2024-07-24T07:41:57Z

Hmm, you seem right. I observed similar 3-5 times differences in speed between 0.27<=diffusers<=0.29.1 w peft and 0.23<=diffusers<=0.26 w/o peft. At version diffusers==0.27, peft was set to compulsory. peft really seems to could add overhead 🤔.

a-r-r-o-w · 2024-07-24T12:39:04Z

cc @sayakpaul

sayakpaul · 2024-07-24T13:07:40Z

Cc: @BenjaminBossan

BenjaminBossan · 2024-07-24T15:23:34Z

Could you please provide a full code example so that I can try to reproduce these timings? Also, what version of PEFT do you have installed?

zengjie617789 · 2024-07-25T02:11:48Z

@BenjaminBossan The code are plain code to load model and run inference, and I load LCM lora weight as a test as the same as using other loras. Code snippets below like:

pipe = StableDiffusionXLPipeline.from_pretrained(model_path,  torch_dtype=torch.float16, variant="fp16", use_safetensors=True).to(device)
t1 = time.time()
pipe.load_lora_weights("/data/***/models/lora_weights/lcm-lora-sdxl/", weight_name="pytorch_lora_weights.safetensors")
print(f"load lcm lora weights cost: {time.time()- t1}")

diffusers==0.25.1 peft==0.12.0
the gap of time are because of the installation of peft or not.

BenjaminBossan · 2024-07-25T13:49:42Z

Thanks for providing more context. Since you use a private adapter, I changed the code a little bit to use one from the Hub:

import time

import torch
from diffusers import StableDiffusionXLPipeline

try:
    import peft
    print("peft is installed")
    print(peft.__version__)
    print(peft.__path__)
except ImportError:
    print("peft is not installed")

device = 0
model_path = "stabilityai/stable-diffusion-xl-base-1.0"

pipe = StableDiffusionXLPipeline.from_pretrained(
    model_path, torch_dtype=torch.float16, variant="fp16", use_safetensors=True,
).to(device)
t1 = time.time()
pipe.load_lora_weights("latent-consistency/lcm-lora-sdxl", weight_name="pytorch_lora_weights.safetensors")
print(f"load lcm lora weights time: {time.time()- t1:.2f}s")

With this, I can confirm these numbers (averaged over 5 runs):

diffusers v0.23.0, no PEFT: ~0.7s
diffusers v0.23.0, PEFT v0.12.0: ~2.0s
diffusers v0.29.0, PEFT v.012.0: ~2.0s

I investigated further why PEFT is slower and what I found is that with PEFT, when the LoRA layer is created (before the actual checkpoint is loaded), PEFT will instantiate actual LoRA weights. Without PEFT, empty weights are created (meta device).

Next I tried to turn off using empty weights to check if that really makes the difference. However, when passing low_cpu_mem_usage=False to load_lora_weights, which from my understanding controls this behavior, I found that the value is not correctly passed down, so that here, it is set to True again. Not sure if this is intentional or not. To get it to work, I had to hard-code low_cpu_mem_usage=False in this line.

Long story short, after forcing this, I get very similar load times of ~1.7s. So in a sense this is a regression, as using the PEFT backend in diffusers is not loading empty weights. I'm not quite sure if this is an oversight, a bug, or intended for some reason. @sayakpaul do you think this could be added for PEFT LoRA loading?

sayakpaul · 2024-07-25T13:53:20Z

We had brought it up in the past a couple of times but as you can see the difference is not really that significant so, I am not sure about the value add here given the number of lines of code to be added to PEFT.

sayakpaul · 2024-07-25T13:57:31Z

I also don’t understand what was incorrectly passed down for low_cpu_mem_usage in the function you mentioned. Do you mean the value was overridden when you passed low_cpu_mem_usage=False?

BenjaminBossan · 2024-07-25T14:07:19Z

We had brought it up in the past a couple of times but as you can see the difference is not really that significant so, I am not sure about the value add here given the number of lines of code to be added to PEFT.

Well, it's up to you and the other diffusers maintainers if you think this is worth fixing or not. On the PEFT side, I think it could be a feature to allow passing some argument to inject_adapter_in_model to tell PEFT to load the weights only on meta device, then this change would be easy to make on the diffusers side. I'll put it on the backlog and can notify you once we the feature is added.

I also don’t understand what was incorrectly passed down for low_cpu_mem_usage in the function you mentioned. Do you mean the value was overridden when you passed low_cpu_mem_usage=False?

The value is not passed on in this line:

https://github.com/huggingface/diffusers/blob/v0.23.0/src/diffusers/loaders.py#L3234

It is set correctly in the kwargs but is not passed to load_lora_into_unet (not sure if intentional or not). As this is v0.23.0, there is probably not any TODO from this finding, even if it's an error.

sayakpaul · 2024-07-25T14:15:49Z

Ah you are right. But we have moved from this already. So, I think we can ignore it for now.

But on a second thought, I think faster loading support would indeed be nice especially for larger adapters. Please let us know when this has been shipper in PEFT and we can do the necessary changes in diffusers.

BenjaminBossan · 2024-07-26T10:10:17Z

I did some quick and dirty experiments and created a draft PR based on those. There is indeed a speed up for Llama3 8B in this test, although loading the base model is still the overall bottleneck (even with hot cache and M.2 SSD).

Overall, I thus think the impact of this feature will not be huge, especially since this a one time cost, but implementing it is not trivial. Therefore, I'll leave this in a draft PR stage for the time being while working on higher priority items.

sayakpaul · 2024-07-26T10:25:12Z

This is what I had expected too. Thanks for quickly checking!

zengjie617789 · 2024-07-29T02:53:51Z

Thank you for you all. I test the version as refered, the version are as below:

but the time cost much long,

If I test it without peft:

The nvidia card is A100.
Obviously, the difference between with peft or not is huge. And I wonder how to get the result of 0.7s vs 2s.

BenjaminBossan · 2024-07-29T09:25:55Z

I test the version as refered, the version are as below:

Just to be clear, the test that I mentioned is in a draft PR, so it is not available on PEFT yet. And even if you installed PEFT from my branch, it would still not work because diffusers would need to make some changes too to opt into that feature. I will get back to that PR when I have a bit of extra time.

And I wonder how to get the result of 0.7s vs 2s.

These were based on another adapter ("latent-consistency/lcm-lora-sdxl") since you seem to use a private one that I cannot load. Also, a lot depends on the hardware. When you run the snippet I posted above, what times do you get? Please run a couple of times to warm up the cache and lower the variance.

github-actions · 2024-09-14T15:05:37Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

BenjaminBossan · 2024-09-16T14:00:03Z

not stale, huggingface/peft#1961 is being worked on

github-actions · 2024-10-15T15:04:12Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

sayakpaul · 2024-10-15T15:23:33Z

With #9510 and the latest versions of peft and transformers (installed from main as of now), this should be significantly improved. So, I am going to close this. Feel free to reopen.

charchit7 mentioned this issue Aug 8, 2024

ENH: Allow empty initialization of adapter weight huggingface/peft#1961

Merged

9 tasks

BenjaminBossan mentioned this issue Aug 28, 2024

load multiple LORAs, and the load time increases linearly #9297

Closed

BenjaminBossan mentioned this issue Sep 12, 2024

Loading lora weights for FLUX pipeline is extremely slow huggingface/peft#2055

Closed

4 tasks

github-actions bot added the stale Issues that haven't received updates label Sep 14, 2024

BenjaminBossan removed the stale Issues that haven't received updates label Sep 16, 2024

yiyixuxu added the peft label Sep 20, 2024

sayakpaul mentioned this issue Sep 24, 2024

[LoRA] allow loras to be loaded with low_cpu_mem_usage. #9510

Merged

github-actions bot added the stale Issues that haven't received updates label Oct 15, 2024

a-r-r-o-w removed the stale Issues that haven't received updates label Oct 15, 2024

sayakpaul closed this as completed Oct 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why loading a lora weights so low? #8953

Why loading a lora weights so low? #8953

zengjie617789 commented Jul 24, 2024 •

edited

Loading

tolgacangoz commented Jul 24, 2024

a-r-r-o-w commented Jul 24, 2024

sayakpaul commented Jul 24, 2024

BenjaminBossan commented Jul 24, 2024

zengjie617789 commented Jul 25, 2024 •

edited

Loading

BenjaminBossan commented Jul 25, 2024

sayakpaul commented Jul 25, 2024

sayakpaul commented Jul 25, 2024

BenjaminBossan commented Jul 25, 2024

sayakpaul commented Jul 25, 2024

BenjaminBossan commented Jul 26, 2024 •

edited

Loading

sayakpaul commented Jul 26, 2024

zengjie617789 commented Jul 29, 2024

BenjaminBossan commented Jul 29, 2024

github-actions bot commented Sep 14, 2024

BenjaminBossan commented Sep 16, 2024

github-actions bot commented Oct 15, 2024

sayakpaul commented Oct 15, 2024

Why loading a lora weights so low? #8953

Why loading a lora weights so low? #8953

Comments

zengjie617789 commented Jul 24, 2024 • edited Loading

tolgacangoz commented Jul 24, 2024

a-r-r-o-w commented Jul 24, 2024

sayakpaul commented Jul 24, 2024

BenjaminBossan commented Jul 24, 2024

zengjie617789 commented Jul 25, 2024 • edited Loading

BenjaminBossan commented Jul 25, 2024

sayakpaul commented Jul 25, 2024

sayakpaul commented Jul 25, 2024

BenjaminBossan commented Jul 25, 2024

sayakpaul commented Jul 25, 2024

BenjaminBossan commented Jul 26, 2024 • edited Loading

sayakpaul commented Jul 26, 2024

zengjie617789 commented Jul 29, 2024

BenjaminBossan commented Jul 29, 2024

github-actions bot commented Sep 14, 2024

BenjaminBossan commented Sep 16, 2024

github-actions bot commented Oct 15, 2024

sayakpaul commented Oct 15, 2024

zengjie617789 commented Jul 24, 2024 •

edited

Loading

zengjie617789 commented Jul 25, 2024 •

edited

Loading

BenjaminBossan commented Jul 26, 2024 •

edited

Loading