Llama 13b Model in full precision loading into system RAM, taking 30 minutes to fully load & runtime error #327

official-elinas · 2023-03-15T00:09:16Z

Describe the bug

When I try to load llama 13b-hf in full precision using the following command, it takes approximately 30 minutes to load then about 70s to load into vram. This model is located on an NVMe drive and other models like OPT load fine and immediately. I am splitting between 2 GPUs and this was working not too long ago just fine.

python server.py --listen --model llama-13b --gpu-memory 21 13

~~Also, I should note, forcing the --bf16 flag does not help.~~

Is there an existing issue for this?

I have searched the existing issues

Reproduction

Run the previous command on a new venv with the updated transformers branch (latest requirements.txt/project pulled)

Screenshot

See below.

Logs

$ python server.py --listen --model llama-13b --gpu-memory 21 13
Loading llama-13b...
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████| 41/41 [01:11<00:00,  1.74s/it]
Loaded the model in 1880.04 seconds.

As you can see, it loads extremely slowly into system memory then completely empties it and loads into VRAM. No logging in between Loading llama-13b... is given, it just hangs until the model is loaded into system memory.

I did another test with the --bf16 flag and it loaded much faster but still slowly. (I did up the max memory too but not sure if that really did anything as it's still far below the max for my GPUs.)'

3090 - 13.1GB
A4000 - 12.3GB

$ python server.py --listen --model llama-13b --bf16 --gpu-memory 23 15
Loading llama-13b...
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████| 41/41 [01:12<00:00,  1.76s/it]
Loaded the model in 750.97 seconds.

Additional Logs

Running the model after loading produces this

Traceback (most recent call last):
  File "G:\llm-webui\installer_files\env\lib\threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "G:\llm-webui\installer_files\env\lib\threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "G:\llm-webui\text-generation-webui\modules\callbacks.py", line 64, in gentask
    ret = self.mfunc(callback=_callback, **self.kwargs)
  File "G:\llm-webui\text-generation-webui\modules\text_generation.py", line 191, in generate_with_callback
    shared.model.generate(**kwargs)
  File "G:\llm-webui\installer_files\env\lib\site-packages\torch\autograd\grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "G:\llm-webui\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 1452, in generate
    return self.sample(
  File "G:\llm-webui\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 2504, in sample
    next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

System Info

Windows 10
i7-5960X
64 GB RAM
RTX 3090 24GB 
RTX A4000 16GB
Samsung 980 NVMe (models on this drive)

Note: I made some edits to this issue as I tried to add some more detail/methods.

The text was updated successfully, but these errors were encountered:

oobabooga · 2023-03-15T02:36:07Z

I have no idea what could be causing this. Some people have had better luck using WSL than windows itself.

JousterL · 2023-03-15T04:27:21Z

Just chiming in that I'm experiencing the same issue with 13b. It stalls for ~15 minutes after the bitsandbytes CUDA SETUP phase, without any significant disk utilization and only one CPU core running at full, before finally kicking over to the shard loading. Same error on attempting to request a response, the 'inf' 'nan' or element <0.

Win 11
Ryzen 9 7950X
64 GB RAM
RTX 4090 24GB
Model is on a 2TB WD Black spinning rust.

Simon1V · 2023-03-15T08:53:17Z

I am experiencing the same issue
Windows 10
i7 9700k
RTX A6000, GTX 1660Super
32GB RAM
Evo 970 NVME drive

YukiSakuma · 2023-03-17T19:53:35Z

I'm experiencing the same issue on windows 10 using RTX 3060 but using the LLama 7b one, the installer is located on an SSD but it takes 18 minutes to get past the CUDA setup

Starting the web UI...

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
CUDA SETUP: Loading binary A:\oobabooga-windows\oobabooga\installer_files\env\lib\site-packages\bitsandbytes\libbitsandbytes_cuda116.dll...
Loading LLaMA-7B...
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████| 33/33 [00:10<00:00,  3.01it/s]
Loaded the model in 1113.38 seconds.
Loading the extension "gallery"... Ok.
A:\oobabooga-windows\oobabooga\installer_files\env\lib\site-packages\gradio\deprecation.py:40: UserWarning: The 'type' parameter has been deprecated. Use the Number component instead.
  warnings.warn(value)
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.

and I also get the inf, nan error when trying to generate
RuntimeError: probability tensor contains either inf, nan or element < 0

I followed the instructions here about the bitsandbytes guide:
#147 (comment)

If I uncheck the do_sample the error disappears but the generated response I get is ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇

YukiSakuma · 2023-03-17T21:31:17Z

Update: I redownloaded the weights using HFv2 and now it loads fast as expected, it seems the weights were the problem.

demoergo · 2023-03-20T17:42:29Z

Update: I redownloaded the weights using HFv2 and now it loads fast as expected, it seems the weights were the problem.

I had the exact same problem. My weights were from the original that were converted using convert_llama_weights_to_hf.py. The weights from Decapoda Research are different and fixed it. (Thanks!)

JousterL · 2023-03-20T18:11:10Z

Can also confirm that I re-downloaded the weights (from the same Torrent) and now it works fine. Weird.

github-actions · 2023-04-19T23:16:33Z

This issue has been closed due to inactivity for 30 days. If you believe it is still relevant, please leave a comment below.

official-elinas added the bug Something isn't working label Mar 15, 2023

official-elinas changed the title ~~Llama 13b Model in full precision loading into system RAM and taking 30 minutes to fully load~~ Llama 13b Model in full precision loading into system RAM and taking 30 minutes to fully load & runtime error Mar 15, 2023

official-elinas changed the title ~~Llama 13b Model in full precision loading into system RAM and taking 30 minutes to fully load & runtime error~~ Llama 13b Model in full precision loading into system RAM, taking 30 minutes to fully load & runtime error Mar 15, 2023

github-actions bot added the stale label Apr 19, 2023

github-actions bot closed this as completed Apr 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Llama 13b Model in full precision loading into system RAM, taking 30 minutes to fully load & runtime error #327

Llama 13b Model in full precision loading into system RAM, taking 30 minutes to fully load & runtime error #327

official-elinas commented Mar 15, 2023 •

edited

Loading

oobabooga commented Mar 15, 2023

JousterL commented Mar 15, 2023 •

edited

Loading

Simon1V commented Mar 15, 2023 •

edited

Loading

YukiSakuma commented Mar 17, 2023

YukiSakuma commented Mar 17, 2023

demoergo commented Mar 20, 2023 •

edited

Loading

JousterL commented Mar 20, 2023

github-actions bot commented Apr 19, 2023

Llama 13b Model in full precision loading into system RAM, taking 30 minutes to fully load & runtime error #327

Llama 13b Model in full precision loading into system RAM, taking 30 minutes to fully load & runtime error #327

Comments

official-elinas commented Mar 15, 2023 • edited Loading

Describe the bug

Is there an existing issue for this?

Reproduction

Screenshot

Logs

Additional Logs

System Info

oobabooga commented Mar 15, 2023

JousterL commented Mar 15, 2023 • edited Loading

Simon1V commented Mar 15, 2023 • edited Loading

YukiSakuma commented Mar 17, 2023

YukiSakuma commented Mar 17, 2023

demoergo commented Mar 20, 2023 • edited Loading

JousterL commented Mar 20, 2023

github-actions bot commented Apr 19, 2023

official-elinas commented Mar 15, 2023 •

edited

Loading

JousterL commented Mar 15, 2023 •

edited

Loading

Simon1V commented Mar 15, 2023 •

edited

Loading

demoergo commented Mar 20, 2023 •

edited

Loading