High cpu memory usage as bf16 model is auto loaded as fp32 #34743

Qubitium · 2024-11-15T07:45:34Z

System Info

Ubuntu 24.04
Transformers 4.46.2
Accelerator 1.1.1
Safetensor 0.4.5

Who can help?

@ArthurZucker

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Unexpected 2x cpu memory usage due to bf16 safetensor loaded as float32 on device=cpu.

Manually passing torch_dtype=torch.bfloat16 has no such issue but this should not be necessary since both model.config and safentensor files has proper bfloat16.

Sample reproducing code:

import torch
from transformers import AutoModelForCausalLM
import psutil

# model is stored as bf16 safetensor
model_file = "meta-llama/Llama-3.2-1B-Instruct"
model = AutoModelForCausalLM.from_pretrained(model_file)

process = psutil.Process()
memory_info = process.memory_info()
print(f"RSS (Resident Set Size): {memory_info.rss / 1024 / 1024:.2f} MB")
print(f"VMS (Virtual Memory Size): {memory_info.vms / 1024 / 1024:.2f} MB")

print(f"model config dtype is {model.config.torch_dtype}")
assert model.config.torch_dtype == torch.bfloat16

p = model.parameters().__next__()
print(f"model first parameter dtype: {p.dtype}, device: {p.device}")
assert p.device == torch.device("cpu")
assert p.dtype == torch.bfloat16

Code output:

Traceback (most recent call last):
  File "/GPTQModel/test.py", line 20, in <module>
    assert p.dtype == torch.bfloat16
           ^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError
RSS (Resident Set Size): 5189.39 MB <----- High memory usage
VMS (Virtual Memory Size): 41335.09 MB
model config dtype is torch.bfloat16
model first parameter dtype: torch.float32, device: cpu. <----- Wrong dtype

Expected behavior

Modify above code pass torch_dtype=torch.bfloat16 to from_pretrained and memory usage is normal/expected:

RSS (Resident Set Size): 603.85 MB <----- Expected memory usage
VMS (Virtual Memory Size): 40607.80 MB
model config dtype is torch.bfloat16
model first parameter dtype: torch.bfloat16, device: cpu

There are two related issues here:

bfloat16 wrongly inflated to float32 causing very high memory usage
safetensor weights should be lazy loading so it should only be around 600MB of weights loaded

Manually passing dtype=bfloat16 to from_pretrained fixes this issue.

The text was updated successfully, but these errors were encountered:

LysandreJik · 2024-11-15T09:40:18Z

Hey @Qubitium, the model was indeed serialized as bf16, but here you're not specifying in which dtype you would like to load it.

We follow torch's default loading mechanism, which is to automatically load it in the default torch.dtype (here, fp32) so as to be compatible with all hardwares and setups.

In order to update the dtype in which it should be loaded, please change this line:

- model = AutoModelForCausalLM.from_pretrained(model_file)
+ model = AutoModelForCausalLM.from_pretrained(model_file, torch_dtype=torch.bfloat16)

You can also use 'auto' so as to respect the dtype of the weights themselves:

- model = AutoModelForCausalLM.from_pretrained(model_file)
+ model = AutoModelForCausalLM.from_pretrained(model_file, torch_dtype='auto')

You can read more about this in the from_pretrained documentation which I am pasting below:

Qubitium · 2024-11-15T11:08:47Z

@LysandreJik It's 2024 and I would like to propose that the default float32 be modified. Please read the below with a light heart.

Reasons:

The default float32 appears more of a fail-safe default than compat default in 2024
The default float32 will oom 99.9% of world's desktops when loading via cpu and 99.999% of the world's consumer/enterprise gpus when loading a 32B bfloat16 model (Qwen 2.5 Coder 32B as example). As far as compat, I would argue it does more harm than good and make more systems non-compatible when using existing default. fp32 would require ~120GB cpu ram/vram. Is 32B really a large model in 2024? It doesn't matter if safetensor/lazy is used, when first token is generated, all model layers are loaded into device.
AutoModelForCausalLM has auto in name, but it's only auto, sometimes. When? We don't know.
from_pretrained honor and read model properties by default in config.json but not dtype in said json.
Model maker makes bf16, user wants to load model using AutoModelForCausalLM and the api returns by default fp32. Assuming cpu/gpu device is compatible with config.dtype, why?
Less people know how to calculate 32B into ram/vram usage and even less know what is bfloat16 vs float32, or know how to read a config.json and extract the proper value. Make it easy for users by using better default.
There is little auto about dtype=auto. It reads from config.json, first, then does auto. What does auto mean in this context if it reads from config?

Overall, accept the config.json default as truth unless there is an override, or the default is really in-comptible with gpu/cpu: when a device does not physically support it model specified dtype.

torch_dtype (`str` or `torch.dtype`, *optional*):
     Override the default `torch.dtype` and load the model under a specific `dtype`. The different options
     are:

     1. `torch.float16` or `torch.bfloat16` or `torch.float`: load in a specified
      `dtype`, ignoring the model's `config.torch_dtype` if one exists. If not specified
      - the model will get loaded in `torch.float` (fp32).

      2. `"auto"` - A `torch_dtype` entry in the `config.json` file of the model will be
      attempted to be used. If this entry isn't found then next check the `dtype` of the first weight in
      the checkpoint that's of a floating point type and use that as `dtype`. This will load the model
      using the `dtype` it was saved in at the end of the training. It can't be used as an indicator of how
      the model was trained. Since it could be trained in one of half precision dtypes, but saved in fp32.

      3. A string that is a valid `torch.dtype`. E.g. "float32" loads the model in `torch.float32`, "float16" loads in `torch.float16` etc.

      <Tip>

      For some models the `dtype` they were trained in is unknown - you may try to check the model's paper or
      reach out to the authors and ask them to add this information to the model's card and to insert the
      `torch_dtype` entry in `config.json` on the hub.

      </Tip>

LysandreJik · 2024-11-19T10:52:41Z

Thanks for your feedback @Qubitium! If we were to change the default here we would do it when passing from major 4 to 5 as it's a very significant change.

Something we can do right now however is to make the auto mechanism clearer as I believe it's exactly what you're looking for.

@stevhliu, would it be possible to make this much more visible in the docs? There are many areas where we could showcase the "auto": in the quickstart with model loading, in the quantization docs, among others.

stevhliu · 2024-11-19T18:29:12Z

For sure, I'll open a PR to make dtype="auto" more visible!

In the next version of the docs, the "auto" dtype mechanism will also be easier and clearer to find (preview here).

ArthurZucker · 2024-11-25T14:41:05Z

Completely agree with you @Qubitium on the motivations, we are kind of stuck with this because of how big of a change it is.
As lysandre said, let's init to "auto" in the futur. I'll open a PR to keep track of this!

Qubitium added the bug label Nov 15, 2024

Qubitium changed the title ~~2x cpu memory usage as bf16 model is auto loaded as fp32~~ High cpu memory usage as bf16 model is auto loaded as fp32 Nov 15, 2024

Qubitium mentioned this issue Nov 15, 2024

safetensor/mmap memory leak when per-layer weights are converted do other dtypes #34366

Open

4 tasks

LysandreJik added PyTorch Anything PyTorch Performance labels Nov 15, 2024

ArthurZucker mentioned this issue Nov 25, 2024

🚨🚨🚨🚨🚨🚨🚨🚨🚨 default to "auto" dtype #34919

Open

stevhliu mentioned this issue Dec 3, 2024

[docs] Increase visibility of torch_dtype="auto" #35067

Merged

stevhliu closed this as completed in #35067 Dec 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High cpu memory usage as bf16 model is auto loaded as fp32 #34743

High cpu memory usage as bf16 model is auto loaded as fp32 #34743

Qubitium commented Nov 15, 2024 •

edited

Loading

LysandreJik commented Nov 15, 2024

Qubitium commented Nov 15, 2024 •

edited

Loading

LysandreJik commented Nov 19, 2024

stevhliu commented Nov 19, 2024

ArthurZucker commented Nov 25, 2024

High cpu memory usage as bf16 model is auto loaded as fp32 #34743

High cpu memory usage as bf16 model is auto loaded as fp32 #34743

Comments

Qubitium commented Nov 15, 2024 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

LysandreJik commented Nov 15, 2024

Qubitium commented Nov 15, 2024 • edited Loading

LysandreJik commented Nov 19, 2024

stevhliu commented Nov 19, 2024

ArthurZucker commented Nov 25, 2024

Qubitium commented Nov 15, 2024 •

edited

Loading

Qubitium commented Nov 15, 2024 •

edited

Loading