Support for LLaMA models #147

ye7iaserag · 2023-02-27T18:22:55Z

Meta just released there LLaMA model family
https://github.com/facebookresearch/llama
Can we got support for that?
They calim that the 13B model is better than GPT 3 175B model

This is how to use the model in the web UI:

LLaMA TUTORIAL

-- oobabooga, March 9th, 2023

oobabooga · 2023-02-27T18:36:41Z

The models are not public yet, unfortunately. You have to request access.

MetaIX · 2023-03-02T18:40:30Z

Psst. Somebody leaked them https://twitter.com/Teknium1/status/1631322496388722689

Sumanai · 2023-03-02T19:32:10Z

Of course, the weights themselves are closed. But the code from the repository should be enough to add support. And where the end users will download the weights from is their problem.

catboxanon · 2023-03-03T15:05:15Z

meta-llama/llama#73

oobabooga · 2023-03-03T17:45:15Z

Done ea5c5eb

Install LLaMa as in their README:

conda activate textgen
git clone https://github.com/facebookresearch/llama
cd llama
pip install -r requirements.txt
pip install -e .

Put the model that you downloaded using your academic credentials on models/LLaMA-7B (the folder name must start with llama)
Put a copy of the files inside that folder too: tokenizer.model and tokenizer_checklist.chk
Start the web ui. I have tested with

python server.py --no-stream --model LLaMA-7B

moorehousew · 2023-03-03T21:49:52Z

Getting a CUDA out-of-memory error- I assume lowmem support isn't included yet?

oobabooga · 2023-03-03T23:51:28Z

This isn't part of Hugging Face yet, so it doesn't have access to 8bit and CPU offloading.

The 7B model uses 14963MiB VRAM on my machine. Reducing the max_seq_len parameter from 2048 to 512 makes this go down to 13843MiB.

musicurgy · 2023-03-04T00:29:26Z

I get a bunch of dependency errors when launching despite setting up LLaMa beforehand (definitely my own fault and probably because of a messed up conda environment)

ModuleNotFoundError: No module named 'fire'
ModuleNotFoundError: No module named 'fairscale'

etc. Any chance you could include these in the default webui requirements assuming they aren't too heavy?

oobabooga · 2023-03-04T00:35:14Z

@musicurgy did you try pip install -r requirements.txt as in #147 (comment)?

musicurgy · 2023-03-04T01:28:06Z

Yeah, after a bit of a struggle I ended up getting it working by just copying all the dependencies into the webui folder. So far the model is really interesting. Thanks for supporting it.

generic-username0718 · 2023-03-04T01:37:39Z

Awesome stuff. I'm able to load LLaMA-7b but trying to load LLaMA-13b crashes with the error:

Traceback (most recent call last):
  File "/home/user/Documents/oobabooga/text-generation-webui/server.py", line 189, in <module>
    shared.model, shared.tokenizer = load_model(shared.model_name)
  File "/home/user/Documents/oobabooga/text-generation-webui/modules/models.py", line 94, in load_model
    model = LLaMAModel.from_pretrained(Path(f'models/{model_name}'))
  File "/home/user/Documents/oobabooga/text-generation-webui/modules/LLaMA.py", line 82, in from_pretrained
    generator = load(
  File "/home/user/Documents/oobabooga/text-generation-webui/modules/LLaMA.py", line 44, in load
    assert world_size == len(
AssertionError: Loading a checkpoint for MP=2 but world size is 1

generic-username0718 · 2023-03-04T02:21:50Z

Anyone reading this you can get past the issue above by changing the world_size variable found in modules/LLaMA.py like this:

def setup_model_parallel() -> Tuple[int, int]:
local_rank = int(os.environ.get("LOCAL_RANK", -1))
world_size = 2

My issue now is I'm running out of VRAM. I'm running dual 3090s and should be able to load the model if it's split among the cards...

generic-username0718 · 2023-03-04T02:28:56Z

Is there a parameter I need to pass to oobabooga to tell it to split the model among my two 3090 gpus?

Morb0 · 2023-03-04T02:33:33Z

Is there a parameter I need to pass to oobabooga to tell it to split the model among my two 3090 gpus?

Try --gpu-memory 10 5, at least that's what the README says.

generic-username0718 · 2023-03-04T02:35:23Z

Sorry super dumb but do I pass this to start-webui.sh? Like

sh start-webui.sh --gpu-memory 10 5?

Morb0 · 2023-03-04T02:38:34Z

Sorry super dumb but do I pass this to start-webui.sh? Like

sh start-webui.sh --gpu-memory 10 5?

Ah, that should work, but if not, edit the file and add this at the end of call python server.py --auto-devices --cai-chat

generic-username0718 · 2023-03-04T03:26:57Z

Thanks friend! I was able to get it with call python server.py --gpu-memory 20 20 --cai-chat

oobabooga · 2023-03-04T03:33:53Z

--gpu-memory should have no effect on LLaMA. This is for models loaded using the from_pretrained function from HF.

For LLaMA, the correct way is to change the global variables inside LLaMA.py like @generic-username0718 did, but I am not very familiar with the parameters yet.

generic-username0718 · 2023-03-04T03:47:32Z

--gpu-memory should have no effect on LLaMA. This is for models loaded using the from_pretrained function from HF.

For LLaMA, the correct way is to change the global variables inside LLaMA.py like @generic-username0718 did, but I am not very familiar with the parameters yet.

I was starting to question my sanity... I think I accidentally was loading opt-13b instead... Sorry if I got people's hopes up

I'm still trying to split the model

Edit: Looks like they've already asked this here: meta-llama/llama#88

USBhost · 2023-03-04T05:08:35Z

bad news for the guys hoping to run 13B

Loading LLaMA-13B...
[W ProcessGroupGloo.cpp:694] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
> initializing model parallel with size 2
> initializing ddp with size 1
> initializing pipeline with size 1
Loading
Traceback (most recent call last):
  File "/UI/text-generation-webui/server.py", line 188, in <module>
    shared.model, shared.tokenizer = load_model(shared.model_name)
  File "/UI/text-generation-webui/modules/models.py", line 94, in load_model
    model = LLaMAModel.from_pretrained(Path(f'models/{model_name}'))
  File "/UI/text-generation-webui/modules/LLaMA.py", line 82, in from_pretrained
    generator = load(
  File "/UI/text-generation-webui/modules/LLaMA.py", line 61, in load
    model.load_state_dict(checkpoint, strict=False)
  File "/UI/text-generation-webui/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1671, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for Transformer:
        size mismatch for tok_embeddings.weight: copying a param with shape torch.Size([32000, 2560]) from checkpoint, the shape in current model is torch.Size([32000, 5120]).
        size mismatch for layers.0.attention.wq.weight: copying a param with shape torch.Size([2560, 5120]) from checkpoint, the shape in current model is torch.Size([5120, 5120]).
        size mismatch for layers.0.attention.wk.weight: copying a param with shape torch.Size([2560, 5120]) from checkpoint, the shape in current model is torch.Size([5120, 5120]).
        size mismatch for layers.0.attention.wv.weight: copying a param with shape torch.Size([2560, 5120]) from checkpoint, the shape in current model is torch.Size([5120, 5120]).
        size mismatch for layers.0.attention.wo.weight: copying a param with shape torch.Size([5120, 2560]) from checkpoint, the shape in current model is torch.Size([5120, 5120]).
        size mismatch for layers.0.feed_forward.w1.weight: copying a param with shape torch.Size([6912, 5120]) from checkpoint, the shape in current model is torch.Size([13824, 5120]).
        size mismatch for layers.0.feed_forward.w2.weight: copying a param with shape torch.Size([5120, 6912]) from checkpoint, the shape in current model is torch.Size([5120, 13824]).
        size mismatch for layers.0.feed_forward.w3.weight: copying a param with shape torch.Size([6912, 5120]) from checkpoint, the shape in current model is torch.Size([13824, 5120]).
        size mismatch for layers.1.attention.wq.weight: copying a param with shape torch.Size([2560, 5120]) from checkpoint, the shape in current model is torch.Size([5120, 5120]).
        etc........

oobabooga · 2023-03-04T05:17:38Z

Did you set MP to '2' here?

https://github.com/oobabooga/text-generation-webui/blob/main/modules/LLaMA.py#L18

See

https://github.com/facebookresearch/llama#inference

MarkSchmidty · 2023-03-04T05:53:15Z

LLaMA-7B can be run on CPU instead of GPU using this fork of the LLaMA repo: https://github.com/markasoftware/llama-cpu

To quote the author "On a Ryzen 7900X, the 7B model is able to infer several words per second, quite a lot better than you'd expect!"

USBhost · 2023-03-04T07:02:07Z

Did you set MP to '2' here?

https://github.com/oobabooga/text-generation-webui/blob/main/modules/LLaMA.py#L18

See

https://github.com/facebookresearch/llama#inference

from llama import LLaMA, ModelArgs, Tokenizer, Transformer

os.environ['RANK'] = '0'
os.environ['WORLD_SIZE'] = '1'
os.environ['MP'] = '2'
os.environ['MASTER_ADDR'] = '127.0.0.1'
os.environ['MASTER_PORT'] = '2223'

def setup_model_parallel() -> Tuple[int, int]:
    local_rank = int(os.environ.get("LOCAL_RANK", -1))
    world_size = 2

    torch.distributed.init_process_group("gloo")
    initialize_model_parallel(world_size)
    torch.cuda.set_device(local_rank)

    # seed must be the same in all processes
    torch.manual_seed(1)
    return local_rank, world_size

I sure did. Also those os.environ don't seem to work.
7B loads fine. PS my GPU is a A6000

TheZennou · 2023-03-04T07:16:20Z

Did you set MP to '2' here?
https://github.com/oobabooga/text-generation-webui/blob/main/modules/LLaMA.py#L18
See
https://github.com/facebookresearch/llama#inference
from llama import LLaMA, ModelArgs, Tokenizer, Transformer

os.environ['RANK'] = '0'
os.environ['WORLD_SIZE'] = '1'
os.environ['MP'] = '2'
os.environ['MASTER_ADDR'] = '127.0.0.1'
os.environ['MASTER_PORT'] = '2223'

def setup_model_parallel() -> Tuple[int, int]:
    local_rank = int(os.environ.get("LOCAL_RANK", -1))
    world_size = 2

    torch.distributed.init_process_group("gloo")
    initialize_model_parallel(world_size)
    torch.cuda.set_device(local_rank)

    # seed must be the same in all processes
    torch.manual_seed(1)
    return local_rank, world_size
I sure did. Also those os.environ don't seem to work. 7B loads fine. PS my GPU is a A6000

I also get the same error, with 13b.

hdelattre · 2023-03-04T07:23:27Z

Anyone else getting really poor results on 7B? I've tried many prompts and parameter variations and it generally ends up as mostly nonsense with lots of repetition. It might just be the model but I saw some 7B output examples posted online that seemed way better than anything I was getting.

BarsMonster · 2023-03-04T08:48:37Z

Is it possible to reduce computation precision on CPU? Down to 8 bit?

Manimap · 2023-03-04T09:22:47Z

Someone made a fork of llama github that apparently runs in 8bit :
https://github.com/tloen/llama-int8

Zero idea if it works or anything.

hopto-dot · 2023-03-04T09:52:22Z

I'm getting the following error when trying to run the 7B model on my rtx 3090, can someone help?

C:\Users\Username\Documents\Git\text-generation-webui>python server.py --listen --no-stream --model LLaMA-7B
Loading LLaMA-7B...
[W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [3ca52znvmj.adobe.io]:2223 (system error: 10049 - The requested address is not valid in its context.).
[W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [3ca52znvmj.adobe.io]:2223 (system error: 10049 - The requested address is not valid in its context.).
> initializing model parallel with size 1
> initializing ddp with size 1
> initializing pipeline with size 1
Loading
Traceback (most recent call last):
  File "C:\Users\Username\Documents\Git\text-generation-webui\server.py", line 188, in <module>
    shared.model, shared.tokenizer = load_model(shared.model_name)
  File "C:\Users\Username\Documents\Git\text-generation-webui\modules\models.py", line 94, in load_model
    model = LLaMAModel.from_pretrained(Path(f'models/{model_name}'))
  File "C:\Users\Username\Documents\Git\text-generation-webui\modules\LLaMA.py", line 82, in from_pretrained
    generator = load(
  File "C:\Users\Username\Documents\Git\text-generation-webui\modules\LLaMA.py", line 58, in load
    torch.set_default_tensor_type(torch.cuda.HalfTensor)
  File "C:\Users\Username\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\__init__.py", line 348, in set_default_tensor_type
    _C._set_default_tensor_type(t)
TypeError: type torch.cuda.HalfTensor not available. Torch not compiled with CUDA enabled.

hdelattre · 2023-03-04T09:54:13Z

@hopto-dot Go here and run the pip command for the 11.7 build on your OS: https://pytorch.org/get-started/locally/

gadaeus · 2023-03-17T04:49:35Z

I'm getting the below error. I'm fairly certain I have the latest weights.

=====================

CUDA SETUP: Loading binary C:\Users\PC\miniconda3\envs\textgen\lib\site-packages\bitsandbytes\libbitsandbytes_cuda116.dll...
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████| 33/33 [00:40<00:00, 1.21s/it]
Traceback (most recent call last):
File "C:\Users\PC\text-generation-webui\server.py", line 215, in
shared.model, shared.tokenizer = load_model(shared.model_name)
File "C:\Users\PC\text-generation-webui\modules\models.py", line 158, in load_model
tokenizer = AutoTokenizer.from_pretrained(Path(f"models/{shared.model_name}/"))
File "C:\Users\PC\miniconda3\envs\textgen\lib\site-packages\transformers\models\auto\tokenization_auto.py", line 677, in from_pretrained
raise ValueError(
ValueError: Tokenizer class LLaMATokenizer does not exist or is not currently imported.

noprotocolunit · 2023-03-17T04:55:12Z

@gadaeus
I think you have to go into tokenizer_config.json and change the LLaMATokenizer to read LlamaTokenizer.
The different capitalization matters (something about new transformers).
I found the info here: #147 (comment)

f8upd8 · 2023-03-17T06:20:48Z

#147 (comment)

This helped, thank you

As detailed in https://github.com/DeXtmL/bitsandbytes-win-prebuilt as well as oobabooga/text-generation-webui#147 (comment)

echo-lalia · 2023-03-20T01:06:45Z

Does anyone have a guess as to what may be causing this error?
I'm in a Ubuntu subsystem in Windows 10. I've tried reinstalling the requirements and restarting the PC. Searching on google returns no relevant results.

Loading llama-7b-hf...
Traceback (most recent call last):
  File "/home/[user]/text-generation-webui/server.py", line 241, in <module>
    shared.model, shared.tokenizer = load_model(shared.model_name)
  File "/home/[user]/text-generation-webui/modules/models.py", line 101, in load_model
    model = load_quantized(model_name)
  File "/home/[user]/text-generation-webui/modules/GPTQ_loader.py", line 55, in load_quantized
    model = load_quant(str(path_to_model), str(pt_path), shared.args.gptq_bits)
TypeError: load_quant() missing 1 required positional argument: 'groupsize'

oobabooga · 2023-03-20T01:09:50Z

See #445 (comment)

dipaloka · 2023-03-22T18:45:55Z

does anybody know where i should start to go about fixing the below errors? running on a m1 max 64gb.

Attempted with HFv2 Model Weights LLaMA-7B


python server.py --model LLaMA-7B --load-in-8bit --no-stream
Loading LLaMA-7B...

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
CUDA SETUP: Required library version not found: libsbitsandbytes_cpu.so. Maybe you need to compile it from source?
CUDA SETUP: Defaulting to libbitsandbytes_cpu.so...
dlopen(/opt/homebrew/Caskroom/miniconda/base/envs/textgen/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cpu.so, 0x0006): tried: '/opt/homebrew/Caskroom/miniconda/base/envs/textgen/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cpu.so' (not a mach-o file)
CUDA SETUP: Required library version not found: libsbitsandbytes_cpu.so. Maybe you need to compile it from source?
CUDA SETUP: Defaulting to libbitsandbytes_cpu.so...
dlopen(/opt/homebrew/Caskroom/miniconda/base/envs/textgen/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cpu.so, 0x0006): tried: '/opt/homebrew/Caskroom/miniconda/base/envs/textgen/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cpu.so' (not a mach-o file)
/opt/homebrew/Caskroom/miniconda/base/envs/textgen/lib/python3.10/site-packages/bitsandbytes/cextension.py:31: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers and GPU quantization are unavailable.
  warn("The installed version of bitsandbytes was compiled without GPU support. "
Traceback (most recent call last):
  File "/Users/username/text-generation-webui/server.py", line 241, in <module>
    shared.model, shared.tokenizer = load_model(shared.model_name)
  File "/Users/username/text-generation-webui/modules/models.py", line 159, in load_model
    model = AutoModelForCausalLM.from_pretrained(checkpoint, **params)
  File "/opt/homebrew/Caskroom/miniconda/base/envs/textgen/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 471, in from_pretrained
    return model_class.from_pretrained(
  File "/opt/homebrew/Caskroom/miniconda/base/envs/textgen/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2588, in from_pretrained
    raise ValueError(
ValueError: 
                        Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit
                        the quantized model. If you want to dispatch the model on the CPU or the disk while keeping
                        these modules in 32-bit, you need to set `load_in_8bit_fp32_cpu_offload=True` and pass a custom
                        `device_map` to `from_pretrained`. Check
                        https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu
                        for more details.

oobabooga · 2023-03-22T19:17:29Z

It looks like you don't have enough memory to load the model. Try adding

python server.py --model LLaMA-7B --load-in-8bit --no-stream --gpu-memory 4000MiB

(adjust the value based on your GPU size)

ye7iaserag · 2023-03-25T02:17:12Z

@oobabooga there is a wrapper for llama.cpp now
Check this out https://github.com/PotatoSpudowski/fastLLaMa

JoshuaMBlaz · 2023-03-30T02:33:32Z

I've been following this tutorial trying to set up llama 7b and eventually 13b, and I'm running into an issue I can't find referenced anywhere online regarding the process, pasted below.

I'm running Windows 10, and I've already attempted the fix detailed HERE. I've tried running the command in Anaconda Prompt (miniconda3) both normally and as an administrator, and I get the same error. I even tried running it in normal CMD which didn't lead anywhere.

Let me know if I'm posting this in the wrong place, formatting incorrectly, or committing some other faux pas - I can't figure out how to do this clean looking 'snippet clipboard content' formatting that people are using here.

(base) D:\AI Image Things\text-generation-webui>python server.py --cai-chat --wbits 4 --model llama-7b-hf --no-stream
Traceback (most recent call last):
File "D:\AI Image Things\text-generation-webui\server.py", line 13, in
from modules import chat, shared, training, ui
File "D:\AI Image Things\text-generation-webui\modules\training.py", line 11, in
from peft import (LoraConfig, get_peft_model, get_peft_model_state_dict,
File "C:\Users\J\miniconda3\lib\site-packages\peft_init_.py", line 22, in
from .mapping import MODEL_TYPE_TO_PEFT_MODEL_MAPPING, PEFT_TYPE_TO_CONFIG_MAPPING, get_peft_config, get_peft_model
File "C:\Users\J\miniconda3\lib\site-packages\peft\mapping.py", line 16, in
from .peft_model import (
File "C:\Users\J\miniconda3\lib\site-packages\peft\peft_model.py", line 31, in
from .tuners import LoraModel, PrefixEncoder, PromptEmbedding, PromptEncoder
File "C:\Users\J\miniconda3\lib\site-packages\peft\tuners_init_.py", line 20, in
from .lora import LoraConfig, LoraModel
File "C:\Users\J\miniconda3\lib\site-packages\peft\tuners\lora.py", line 36, in
import bitsandbytes as bnb
File "C:\Users\J\miniconda3\lib\site-packages\bitsandbytes_init_.py", line 7, in
from .autograd.functions import (
File "C:\Users\J\miniconda3\lib\site-packages\bitsandbytes\autograd_init.py", line 1, in
from ._functions import undo_layout, get_inverse_transform_indices
File "C:\Users\J\miniconda3\lib\site-packages\bitsandbytes\autograd_functions.py", line 9, in
import bitsandbytes.functional as F
File "C:\Users\J\miniconda3\lib\site-packages\bitsandbytes\functional.py", line 17, in
from .cextension import COMPILED_WITH_CUDA, lib
File "C:\Users\J\miniconda3\lib\site-packages\bitsandbytes\cextension.py", line 8, in
from bitsandbytes.cuda_setup.main import CUDASetup
File "C:\Users\J\miniconda3\lib\site-packages\bitsandbytes\cuda_setup\main.py", line 368
if torch.cuda.is_available(): return 'libbitsandbytes_cuda116.dll', None, None, None, None
^
IndentationError: unindent does not match any outer indentation level

(base) D:\AI Image Things\text-generation-webui>

oobabooga · 2023-03-30T02:36:11Z

I recommend using the new one-click installer for native Windows installation

https://github.com/oobabooga/text-generation-webui#one-click-installers

JoshuaMBlaz · 2023-03-30T02:56:51Z

I recommend using the new one-click installer for native Windows installation

https://github.com/oobabooga/text-generation-webui#one-click-installers

I'll give that a shot! Thanks for the work you put in on all this - the accessibility of this wave of AI tech is seriously appreciated.

tillhanke · 2023-03-31T11:07:28Z

I just succeeded in install 13B

@guruace HOW? I don't understand, what I am doing wrong. I get the following error.

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
CUDA SETUP: Required library version not found: libsbitsandbytes_cpu.so. Maybe you need to compile it from source?
CUDA SETUP: Defaulting to libbitsandbytes_cpu.so...
dlopen(/Users/hanke/src/text-generation-webui/.venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cpu.so, 0x0006): tried: '/Users/hanke/src/text-generation-webui/.venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cpu.so' (not a mach-o file), '/System/Volumes/Preboot/Cryptexes/OS/Users/hanke/src/text-generation-webui/.venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cpu.so' (no such file), '/Users/hanke/src/text-generation-webui/.venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cpu.so' (not a mach-o file)
CUDA SETUP: Required library version not found: libsbitsandbytes_cpu.so. Maybe you need to compile it from source?
CUDA SETUP: Defaulting to libbitsandbytes_cpu.so...
dlopen(/Users/hanke/src/text-generation-webui/.venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cpu.so, 0x0006): tried: '/Users/hanke/src/text-generation-webui/.venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cpu.so' (not a mach-o file), '/System/Volumes/Preboot/Cryptexes/OS/Users/hanke/src/text-generation-webui/.venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cpu.so' (no such file), '/Users/hanke/src/text-generation-webui/.venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cpu.so' (not a mach-o file)
/Users/hanke/src/text-generation-webui/.venv/lib/python3.10/site-packages/bitsandbytes/cextension.py:31: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers and GPU quantization are unavailable.
  warn("The installed version of bitsandbytes was compiled without GPU support. "
Loading 13B-conv...
Traceback (most recent call last):
  File "/Users/hanke/src/text-generation-webui/server.py", line 273, in <module>
    shared.model, shared.tokenizer = load_model(shared.model_name)
  File "/Users/hanke/src/text-generation-webui/modules/models.py", line 49, in load_model
    model = AutoModelForCausalLM.from_pretrained(Path(f"{shared.args.model_dir}/{shared.model_name}"), device_map='auto', load_in_8bit=True)
  File "/Users/hanke/src/text-generation-webui/.venv/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 471, in from_pretrained
    return model_class.from_pretrained(
  File "/Users/hanke/src/text-generation-webui/.venv/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2601, in from_pretrained
    raise ValueError(
ValueError:
                        Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit
                        the quantized model. If you want to dispatch the model on the CPU or the disk while keeping
                        these modules in 32-bit, you need to set `load_in_8bit_fp32_cpu_offload=True` and pass a custom
                        `device_map` to `from_pretrained`. Check
                        https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu
                        for more details.

When running python server.py --auto-launch --model 13B
I dont know, why it is trying to use cuda, when MPS should automatically be chosen? But I also just dont get the workflow of the program yet... Maybe you can give me some pointers, on how I might fix that?
Thanks!

oobabooga · 2023-03-31T13:45:06Z

@tillhanke this is an out of memory error. It means that not enough RAM and/or VRAM was found in the system to load the model.

neuhaus · 2023-04-11T12:28:19Z

so, Llama works, i guess this issue can be closed 👏🏼

Modplay · 2023-04-14T13:53:43Z

File "/home/yinan/text-generation-webui/modules/text_generation.py", line 31, in encode
input_ids = shared.tokenizer.encode(str(prompt), return_tensors='pt', add_special_tokens=add_special_tokens)
AttributeError: 'NoneType' object has no attribute 'encode'

xNul · 2023-04-14T14:50:31Z

@Modplay you haven't downloaded or set the model yet

oobabooga pinned this issue Mar 4, 2023

collant mentioned this issue Mar 16, 2023

CUDA Setup failed despite GPU being available. Inspect the CUDA SETUP outputs above to fix your environment! bitsandbytes-foundation/bitsandbytes#175

Closed

YukiSakuma mentioned this issue Mar 17, 2023

Llama 13b Model in full precision loading into system RAM, taking 30 minutes to fully load & runtime error #327

Closed

1 task

VldmrB mentioned this issue Mar 17, 2023

Windows bitsandbytes error #388

Closed

1 task

stephenholli mentioned this issue Mar 18, 2023

bitsandbytes CUDA Errors After Manual Patch and Prior Working Install #406

Closed

1 task

niclimcy added a commit to niclimcy/bitsandbytes that referenced this issue Mar 18, 2023

Hardcode windows

9eeb525

As detailed in https://github.com/DeXtmL/bitsandbytes-win-prebuilt as well as oobabooga/text-generation-webui#147 (comment)

niclimcy added a commit to niclimcy/bitsandbytes that referenced this issue Mar 18, 2023

Hardcode windows

38cce97

As detailed in https://github.com/DeXtmL/bitsandbytes-win-prebuilt as well as oobabooga/text-generation-webui#147 (comment)

niclimcy added a commit to niclimcy/bitsandbytes that referenced this issue Mar 18, 2023

Hardcode windows

d2015d1

As detailed in https://github.com/DeXtmL/bitsandbytes-win-prebuilt as well as oobabooga/text-generation-webui#147 (comment)

niclimcy added a commit to niclimcy/bitsandbytes that referenced this issue Mar 18, 2023

Hardcode windows

1041bad

As detailed in https://github.com/DeXtmL/bitsandbytes-win-prebuilt as well as oobabooga/text-generation-webui#147 (comment)

niclimcy added a commit to niclimcy/bitsandbytes that referenced this issue Mar 19, 2023

Hardcode windows

d1cb979

As detailed in https://github.com/DeXtmL/bitsandbytes-win-prebuilt as well as oobabooga/text-generation-webui#147 (comment)

NolanMangalMM mentioned this issue Mar 24, 2023

WSL CUDA Setup failed despite GPU being available, only 4bit working #484

Closed

1 task

Wa-lead mentioned this issue Mar 25, 2023

ZeroDivisionError: integer division or modulo by zero tloen/alpaca-lora#166

Open

hechuanlong mentioned this issue Apr 2, 2023

[Feature] 支持使用accelerate和bitsandbytes THUDM/ChatGLM-6B#347

Closed

mironkraft mentioned this issue May 19, 2023

python3 server.py --chat Ilegal instruction ('core' generated) #2163

Closed

1 task

oobabooga closed this as completed May 20, 2023

Support for LLaMA models #147

Support for LLaMA models #147

Comments

ye7iaserag commented Feb 27, 2023 • edited by oobabooga Loading

LLaMA TUTORIAL

oobabooga commented Feb 27, 2023

MetaIX commented Mar 2, 2023

Sumanai commented Mar 2, 2023

catboxanon commented Mar 3, 2023

oobabooga commented Mar 3, 2023 • edited Loading

moorehousew commented Mar 3, 2023

oobabooga commented Mar 3, 2023

musicurgy commented Mar 4, 2023

oobabooga commented Mar 4, 2023

musicurgy commented Mar 4, 2023

generic-username0718 commented Mar 4, 2023

generic-username0718 commented Mar 4, 2023 • edited Loading

generic-username0718 commented Mar 4, 2023

Morb0 commented Mar 4, 2023

generic-username0718 commented Mar 4, 2023

Morb0 commented Mar 4, 2023 • edited Loading

generic-username0718 commented Mar 4, 2023

oobabooga commented Mar 4, 2023

generic-username0718 commented Mar 4, 2023 • edited Loading

USBhost commented Mar 4, 2023 • edited Loading

oobabooga commented Mar 4, 2023

MarkSchmidty commented Mar 4, 2023

USBhost commented Mar 4, 2023 • edited Loading

TheZennou commented Mar 4, 2023

hdelattre commented Mar 4, 2023

BarsMonster commented Mar 4, 2023

Manimap commented Mar 4, 2023

hopto-dot commented Mar 4, 2023

hdelattre commented Mar 4, 2023

gadaeus commented Mar 17, 2023

noprotocolunit commented Mar 17, 2023 • edited Loading

f8upd8 commented Mar 17, 2023

echo-lalia commented Mar 20, 2023

oobabooga commented Mar 20, 2023

dipaloka commented Mar 22, 2023

oobabooga commented Mar 22, 2023

ye7iaserag commented Mar 25, 2023

JoshuaMBlaz commented Mar 30, 2023

oobabooga commented Mar 30, 2023

JoshuaMBlaz commented Mar 30, 2023

tillhanke commented Mar 31, 2023 • edited Loading

oobabooga commented Mar 31, 2023

neuhaus commented Apr 11, 2023

Modplay commented Apr 14, 2023

xNul commented Apr 14, 2023

ye7iaserag commented Feb 27, 2023 •

edited by oobabooga

Loading

oobabooga commented Mar 3, 2023 •

edited

Loading

generic-username0718 commented Mar 4, 2023 •

edited

Loading

Morb0 commented Mar 4, 2023 •

edited

Loading

generic-username0718 commented Mar 4, 2023 •

edited

Loading

USBhost commented Mar 4, 2023 •

edited

Loading

USBhost commented Mar 4, 2023 •

edited

Loading

noprotocolunit commented Mar 17, 2023 •

edited

Loading

tillhanke commented Mar 31, 2023 •

edited

Loading