Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High cpu memory usage as bf16 model is auto loaded as fp32 #34743

Closed
2 of 4 tasks
Qubitium opened this issue Nov 15, 2024 · 5 comments · Fixed by #35067 · May be fixed by #34919
Closed
2 of 4 tasks

High cpu memory usage as bf16 model is auto loaded as fp32 #34743

Qubitium opened this issue Nov 15, 2024 · 5 comments · Fixed by #35067 · May be fixed by #34919
Labels

Comments

@Qubitium
Copy link
Contributor

Qubitium commented Nov 15, 2024

System Info

Ubuntu 24.04
Transformers 4.46.2
Accelerator 1.1.1
Safetensor 0.4.5

Who can help?

@ArthurZucker

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Unexpected 2x cpu memory usage due to bf16 safetensor loaded as float32 on device=cpu.

Manually passing torch_dtype=torch.bfloat16 has no such issue but this should not be necessary since both model.config and safentensor files has proper bfloat16.

Sample reproducing code:

import torch
from transformers import AutoModelForCausalLM
import psutil

# model is stored as bf16 safetensor
model_file = "meta-llama/Llama-3.2-1B-Instruct"
model = AutoModelForCausalLM.from_pretrained(model_file)

process = psutil.Process()
memory_info = process.memory_info()
print(f"RSS (Resident Set Size): {memory_info.rss / 1024 / 1024:.2f} MB")
print(f"VMS (Virtual Memory Size): {memory_info.vms / 1024 / 1024:.2f} MB")

print(f"model config dtype is {model.config.torch_dtype}")
assert model.config.torch_dtype == torch.bfloat16

p = model.parameters().__next__()
print(f"model first parameter dtype: {p.dtype}, device: {p.device}")
assert p.device == torch.device("cpu")
assert p.dtype == torch.bfloat16

Code output:

Traceback (most recent call last):
  File "/GPTQModel/test.py", line 20, in <module>
    assert p.dtype == torch.bfloat16
           ^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError
RSS (Resident Set Size): 5189.39 MB <----- High memory usage
VMS (Virtual Memory Size): 41335.09 MB
model config dtype is torch.bfloat16
model first parameter dtype: torch.float32, device: cpu. <----- Wrong dtype 

Expected behavior

Modify above code pass torch_dtype=torch.bfloat16 to from_pretrained and memory usage is normal/expected:

RSS (Resident Set Size): 603.85 MB <----- Expected memory usage
VMS (Virtual Memory Size): 40607.80 MB
model config dtype is torch.bfloat16
model first parameter dtype: torch.bfloat16, device: cpu

There are two related issues here:

  1. bfloat16 wrongly inflated to float32 causing very high memory usage
  2. safetensor weights should be lazy loading so it should only be around 600MB of weights loaded

Manually passing dtype=bfloat16 to from_pretrained fixes this issue.

@Qubitium Qubitium added the bug label Nov 15, 2024
@Qubitium Qubitium changed the title 2x cpu memory usage as bf16 model is auto loaded as fp32 High cpu memory usage as bf16 model is auto loaded as fp32 Nov 15, 2024
@LysandreJik
Copy link
Member

Hey @Qubitium, the model was indeed serialized as bf16, but here you're not specifying in which dtype you would like to load it.

We follow torch's default loading mechanism, which is to automatically load it in the default torch.dtype (here, fp32) so as to be compatible with all hardwares and setups.

In order to update the dtype in which it should be loaded, please change this line:

- model = AutoModelForCausalLM.from_pretrained(model_file)
+ model = AutoModelForCausalLM.from_pretrained(model_file, torch_dtype=torch.bfloat16)

You can also use 'auto' so as to respect the dtype of the weights themselves:

- model = AutoModelForCausalLM.from_pretrained(model_file)
+ model = AutoModelForCausalLM.from_pretrained(model_file, torch_dtype='auto')

You can read more about this in the from_pretrained documentation which I am pasting below:

image

@LysandreJik LysandreJik added PyTorch Anything PyTorch Performance labels Nov 15, 2024
@Qubitium
Copy link
Contributor Author

Qubitium commented Nov 15, 2024

@LysandreJik It's 2024 and I would like to propose that the default float32 be modified. Please read the below with a light heart.

Reasons:

  • The default float32 appears more of a fail-safe default than compat default in 2024
  • The default float32 will oom 99.9% of world's desktops when loading via cpu and 99.999% of the world's consumer/enterprise gpus when loading a 32B bfloat16 model (Qwen 2.5 Coder 32B as example). As far as compat, I would argue it does more harm than good and make more systems non-compatible when using existing default. fp32 would require ~120GB cpu ram/vram. Is 32B really a large model in 2024? It doesn't matter if safetensor/lazy is used, when first token is generated, all model layers are loaded into device.
  • AutoModelForCausalLM has auto in name, but it's only auto, sometimes. When? We don't know.
  • from_pretrained honor and read model properties by default in config.json but not dtype in said json.
  • Model maker makes bf16, user wants to load model using AutoModelForCausalLM and the api returns by default fp32. Assuming cpu/gpu device is compatible with config.dtype, why?
  • Less people know how to calculate 32B into ram/vram usage and even less know what is bfloat16 vs float32, or know how to read a config.json and extract the proper value. Make it easy for users by using better default.
  • There is little auto about dtype=auto. It reads from config.json, first, then does auto. What does auto mean in this context if it reads from config?

Overall, accept the config.json default as truth unless there is an override, or the default is really in-comptible with gpu/cpu: when a device does not physically support it model specified dtype.

torch_dtype (`str` or `torch.dtype`, *optional*):
     Override the default `torch.dtype` and load the model under a specific `dtype`. The different options
     are:

     1. `torch.float16` or `torch.bfloat16` or `torch.float`: load in a specified
      `dtype`, ignoring the model's `config.torch_dtype` if one exists. If not specified
      - the model will get loaded in `torch.float` (fp32).

      2. `"auto"` - A `torch_dtype` entry in the `config.json` file of the model will be
      attempted to be used. If this entry isn't found then next check the `dtype` of the first weight in
      the checkpoint that's of a floating point type and use that as `dtype`. This will load the model
      using the `dtype` it was saved in at the end of the training. It can't be used as an indicator of how
      the model was trained. Since it could be trained in one of half precision dtypes, but saved in fp32.

      3. A string that is a valid `torch.dtype`. E.g. "float32" loads the model in `torch.float32`, "float16" loads in `torch.float16` etc.

      <Tip>

      For some models the `dtype` they were trained in is unknown - you may try to check the model's paper or
      reach out to the authors and ask them to add this information to the model's card and to insert the
      `torch_dtype` entry in `config.json` on the hub.

      </Tip>

@LysandreJik
Copy link
Member

Thanks for your feedback @Qubitium! If we were to change the default here we would do it when passing from major 4 to 5 as it's a very significant change.

Something we can do right now however is to make the auto mechanism clearer as I believe it's exactly what you're looking for.

@stevhliu, would it be possible to make this much more visible in the docs? There are many areas where we could showcase the "auto": in the quickstart with model loading, in the quantization docs, among others.

@stevhliu
Copy link
Member

For sure, I'll open a PR to make dtype="auto" more visible!

In the next version of the docs, the "auto" dtype mechanism will also be easier and clearer to find (preview here).

@ArthurZucker
Copy link
Collaborator

Completely agree with you @Qubitium on the motivations, we are kind of stuck with this because of how big of a change it is.
As lysandre said, let's init to "auto" in the futur. I'll open a PR to keep track of this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants