Add gpt-fast loader (pure PyTorch) #5180

oobabooga · 2024-01-05T19:51:02Z

Reference: https://github.com/pytorch-labs/gpt-fast

The advantage of this backend is that it requires no wheels or compiled extensions. It is 100% PyTorch, so it should in principle work on any GPU (including AMD, Intel Arc, and Metal).

Feedback is welcome.

Installation

It is necessary to upgrade to the latest PyTorch (Nightly) and clone the repository:

pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu121 --upgrade
git clone https://github.com/pytorch-labs/gpt-fast repositories

The first command changes for each hardware and can be found here: https://pytorch.org/get-started/locally/

Model conversion

python scripts/convert_hf_checkpoint.py --checkpoint_dir models/llama-2-7b-hf
python quantize.py --checkpoint_path models/llama-2-7b-hf/model.pth --mode int4

Load a model

python server.py --model llama-2-7b-hf --loader "gpt-fast"

TODO

It is possible to use torch.compile() with this backend, which improves performance by a factor of around 2. I haven't added the option yet.

lufixSch · 2024-01-07T12:38:07Z

I tried running the quantization on my AMD RX 7900 XTX as described and it failed with the following output:

Loading model ...
Quantizing model weights for int4 weight-only affine per-channel groupwise quantization
linear: layers.0.attention.wqkv, in=4096, out=12288
Traceback (most recent call last):
  File "/data/linux_data/AI/LLM/WebUI/repositories/gpt-fast/quantize.py", line 605, in <module>
    quantize(args.checkpoint_path, args.mode, args.groupsize, args.calibration_tasks, args.calibration_limit, args.calibration_seq_length, args.pad_calibration_inputs, args.percdamp, args.blocksize, args.label)
  File "/data/linux_data/AI/LLM/WebUI/repositories/gpt-fast/quantize.py", line 552, in quantize
    quantized_state_dict = quant_handler.create_quantized_state_dict()
  File "/data/linux_data/AI/LLM/WebUI/.venvs/llm1100/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/data/linux_data/AI/LLM/WebUI/repositories/gpt-fast/quantize.py", line 416, in create_quantized_state_dict
    weight_int4pack, scales_and_zeros = prepare_int4_weight_and_scales_and_zeros(
  File "/data/linux_data/AI/LLM/WebUI/repositories/gpt-fast/quantize.py", line 351, in prepare_int4_weight_and_scales_and_zeros
    weight_int4pack = torch.ops.aten._convert_weight_to_int4pack(weight_int32, inner_k_tiles)
  File "/data/linux_data/AI/LLM/WebUI/.venvs/llm1100/lib/python3.10/site-packages/torch/_ops.py", line 825, in __call__
    return self_._op(*args, **(kwargs or {}))
RuntimeError: _convert_weight_to_int4pack_cuda is not available for build.

Edit: Seems like I'm not the only one on ROCm with this error: pytorch-labs/gpt-fast#12 (comment)

oobabooga · 2024-01-07T13:23:46Z

Thanks for the feedback @lufixSch. That's a bummer, it seems like every time some pure PyTorch loader appears, it doesn't really work on anything but NVIDIA.

rraulison · 2024-01-07T13:36:14Z

(/home/pai/text-generation-webui/installer_files/env) pai@localhost:~/text-generation-webui> python server.py --model Llama-2-7b-chat-hf --loader "gpt-fast"
10:24:51-722859 INFO     Starting Text generation web UI                                                                                                                          
10:24:51-728501 INFO     Loading Llama-2-7b-chat-hf                                                                                                                               
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  2.84it/s]
10:26:23-967663 INFO     LOADER: Transformers                                                                                                                                     
10:26:23-969554 INFO     TRUNCATION LENGTH: 4096                                                                                                                                  
10:26:23-970050 INFO     INSTRUCTION TEMPLATE: Custom (obtained from model metadata)                                                                                              
10:26:23-970568 INFO     Loaded the model in 92.24 seconds.                                                                                                                       
10:26:23-971045 INFO     Loading the extension "gallery"                                                                                                                          
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
10:26:41-825438 INFO     Deleted logs/chat/Assistant/20240107-10-01-41.json.                                                                                                      
Output generated in 284.83 seconds (1.23 tokens/s, 350 tokens, context 76, seed 1022857021)

Is it loaded as transformers model load on web-ui? It seems to work well on my 5600g AMD APU

oobabooga · 2024-01-07T13:41:51Z

No, it should appear as LOADER: gpt-fast. You can enforce that with --loader "gpt-fast"

valiray · 2024-01-26T06:25:00Z

by the looks of it this is now only for nvidia for some reason

Z:\ai\text-generation-webui-gpt-fast>python repositories/quantize.py --checkpoint_path models/llama-2-7b-hf/model.pth --mode int4
Loading model ...
Quantizing model weights for int4 weight-only affine per-channel groupwise quantization
linear: layers.0.attention.wqkv, in=4096, out=12288
Traceback (most recent call last):
  File "Z:\ai\text-generation-webui-gpt-fast\repositories\quantize.py", line 605, in <module>
    quantize(args.checkpoint_path, args.mode, args.groupsize, args.calibration_tasks, args.calibration_limit, args.calibration_seq_length, args.pad_calibration_inputs, args.percdamp, args.blocksize, args.label)
  File "Z:\ai\text-generation-webui-gpt-fast\repositories\quantize.py", line 552, in quantize
    quantized_state_dict = quant_handler.create_quantized_state_dict()
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "Z:\ai\text-generation-webui-gpt-fast\installer_files\env\Lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "Z:\ai\text-generation-webui-gpt-fast\repositories\quantize.py", line 417, in create_quantized_state_dict
    weight.to(torch.bfloat16).to('cuda'), self.groupsize, self.inner_k_tiles
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "Z:\ai\text-generation-webui-gpt-fast\installer_files\env\Lib\site-packages\torch\cuda\__init__.py", line 307, in _lazy_init
    raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled

oobabooga · 2024-01-26T14:09:14Z

Yes, it does seem to not be universal after all despite being pure interpreted PyTorch. In that case, I don't see the appeal of including and maintaining this loader.

oobabooga added 4 commits January 5, 2024 11:11

Add gpt-fast support

b4f1ca9

Autoset the device

88640a7

Improved cache handling

5886450

Remove redundant variables

ba222f0

This was referenced Jan 5, 2024

Intel ARC Support #1575

Closed

AMD thread #3759

Open

oobabooga added 2 commits January 5, 2024 11:56

Remove debug code

c5f3cea

Remove torch.cuda code

587980b

oobabooga closed this Jan 26, 2024

oobabooga deleted the gpt-fast branch January 27, 2024 20:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add gpt-fast loader (pure PyTorch) #5180

Add gpt-fast loader (pure PyTorch) #5180

oobabooga commented Jan 5, 2024 •

edited

Loading

lufixSch commented Jan 7, 2024 •

edited

Loading

oobabooga commented Jan 7, 2024

rraulison commented Jan 7, 2024

oobabooga commented Jan 7, 2024

valiray commented Jan 26, 2024

oobabooga commented Jan 26, 2024

Add gpt-fast loader (pure PyTorch) #5180

Add gpt-fast loader (pure PyTorch) #5180

Conversation

oobabooga commented Jan 5, 2024 • edited Loading

Installation

Model conversion

Load a model

TODO

lufixSch commented Jan 7, 2024 • edited Loading

oobabooga commented Jan 7, 2024

rraulison commented Jan 7, 2024

oobabooga commented Jan 7, 2024

valiray commented Jan 26, 2024

oobabooga commented Jan 26, 2024

oobabooga commented Jan 5, 2024 •

edited

Loading

lufixSch commented Jan 7, 2024 •

edited

Loading