Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running with 12GB RAM (not VRAM)? #14

Open
GXcells opened this issue Aug 15, 2024 · 14 comments
Open

Running with 12GB RAM (not VRAM)? #14

GXcells opened this issue Aug 15, 2024 · 14 comments

Comments

@GXcells
Copy link

GXcells commented Aug 15, 2024

Is there a way to run these models with 12 GB RAM?
With fp8 models it is working but with GGUF models it always fail.

@city96
Copy link
Owner

city96 commented Aug 15, 2024

You mean CPU inference? I'll mark this as a duplicate of #10 and track them as one feature request.

@city96 city96 closed this as completed Aug 15, 2024
@GXcells
Copy link
Author

GXcells commented Aug 15, 2024

No not CPU, GPU inference but with low RAM (I have 16GB VRAM so it is fine, but the RAM is problematic)

@city96
Copy link
Owner

city96 commented Aug 15, 2024

Ah right, I see what you mean. I think once the reload bug is fixed it should make the model smaller, although the T5 text encoder will still take up a lot of RAM so until we can get gguf quants of that working it probably won't work. #5

I assume the regular ComfyUI flux model doesn't work for you either without OOMing or swapping right?

@GXcells
Copy link
Author

GXcells commented Aug 15, 2024

Actually it seems that model does not load in my VRAM but it loads in my RAM.
Not sure why

@GXcells
Copy link
Author

GXcells commented Aug 15, 2024

It is during the gguf_sd_loader function. It seems that model is first loaded in RAM then would be transferred to VRAM?
Is there a way to directly do this in VRAM?

@city96
Copy link
Owner

city96 commented Aug 15, 2024

That's expected. We can technically load into VRAM directly but there's no easy way to check how much free VRAM we have/need since we don't know the size of the checkpoint until we load it (at least I don't think that's easy to check? gguf can use mmap so it might be possible to lazyload them as well.)

@city96 city96 reopened this Aug 15, 2024
@city96
Copy link
Owner

city96 commented Aug 16, 2024

@GXcells Okay so I think I got numpy's mmap to play ball and load it directly onto the GPU, or at least reduce memory pressure by a lot. Could you git pull and retest?

@Meshwa428
Copy link

Okay so i am with 12 gb of RAM (Not VRAM) will i be able to run this model?? has any one tried it?

@city96
Copy link
Owner

city96 commented Aug 16, 2024

Minimum seems to be 13GBs system RAM even with the crappy FP8 T5 model :(

Maybe if you close everything else and add some swap it could manage

@GXcells
Copy link
Author

GXcells commented Aug 16, 2024

@GXcells Okay so I think I got numpy's mmap to play ball and load it directly onto the GPU, or at least reduce memory pressure by a lot. Could you git pull and retest?

Working like a charm, but now I am stuck with the T4 GPU that throws a Cuda OOM when running the function sd.load_diffusion_model_state_dict, even with 4 bit quantized model, in a version of comfy that runs without UI

@city96
Copy link
Owner

city96 commented Aug 16, 2024

You're not using the CUDA override node with a single GPU right?

@GXcells
Copy link
Author

GXcells commented Aug 16, 2024

I am using the branch "totoro4" from https://github.com/camenduru/ComfyUI to use in a jupyter notebook

@city96
Copy link
Owner

city96 commented Aug 16, 2024

Hmm, not familiar with that. T4 should have plenty of VRAM for that unless there's a memory leak or a bug in whatever cuda/numpy/etc versions you have? Or maybe that repository changes the way the function works and tries to load the state dict that we return to cuda? Not too sure, might look into it although there's other, more pressing things that need adding/fixing/optimizing for now.

@GXcells
Copy link
Author

GXcells commented Aug 16, 2024

pressing

Yes a bit weird because I could run the 4 bits quantized models on a RTX3050 4GB VRAM thanks to your node in ComfyUI.
I'll post an issue on the Camenduru github but I also believe they will probably implement your node soon.
I will stay with fp8 models for the time being on the T4 GPU.
Thanks a lot for your super support :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants