Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add gpt-fast loader (pure PyTorch) #5180

Closed
wants to merge 6 commits into from
Closed

Add gpt-fast loader (pure PyTorch) #5180

wants to merge 6 commits into from

Conversation

oobabooga
Copy link
Owner

@oobabooga oobabooga commented Jan 5, 2024

Reference: https://github.com/pytorch-labs/gpt-fast

The advantage of this backend is that it requires no wheels or compiled extensions. It is 100% PyTorch, so it should in principle work on any GPU (including AMD, Intel Arc, and Metal).

Feedback is welcome.

Installation

It is necessary to upgrade to the latest PyTorch (Nightly) and clone the repository:

pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu121 --upgrade
git clone https://github.com/pytorch-labs/gpt-fast repositories

The first command changes for each hardware and can be found here: https://pytorch.org/get-started/locally/

Model conversion

python scripts/convert_hf_checkpoint.py --checkpoint_dir models/llama-2-7b-hf
python quantize.py --checkpoint_path models/llama-2-7b-hf/model.pth --mode int4

Load a model

python server.py --model llama-2-7b-hf --loader "gpt-fast"

TODO

It is possible to use torch.compile() with this backend, which improves performance by a factor of around 2. I haven't added the option yet.

This was referenced Jan 5, 2024
@lufixSch
Copy link

lufixSch commented Jan 7, 2024

I tried running the quantization on my AMD RX 7900 XTX as described and it failed with the following output:

Loading model ...
Quantizing model weights for int4 weight-only affine per-channel groupwise quantization
linear: layers.0.attention.wqkv, in=4096, out=12288
Traceback (most recent call last):
  File "/data/linux_data/AI/LLM/WebUI/repositories/gpt-fast/quantize.py", line 605, in <module>
    quantize(args.checkpoint_path, args.mode, args.groupsize, args.calibration_tasks, args.calibration_limit, args.calibration_seq_length, args.pad_calibration_inputs, args.percdamp, args.blocksize, args.label)
  File "/data/linux_data/AI/LLM/WebUI/repositories/gpt-fast/quantize.py", line 552, in quantize
    quantized_state_dict = quant_handler.create_quantized_state_dict()
  File "/data/linux_data/AI/LLM/WebUI/.venvs/llm1100/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/data/linux_data/AI/LLM/WebUI/repositories/gpt-fast/quantize.py", line 416, in create_quantized_state_dict
    weight_int4pack, scales_and_zeros = prepare_int4_weight_and_scales_and_zeros(
  File "/data/linux_data/AI/LLM/WebUI/repositories/gpt-fast/quantize.py", line 351, in prepare_int4_weight_and_scales_and_zeros
    weight_int4pack = torch.ops.aten._convert_weight_to_int4pack(weight_int32, inner_k_tiles)
  File "/data/linux_data/AI/LLM/WebUI/.venvs/llm1100/lib/python3.10/site-packages/torch/_ops.py", line 825, in __call__
    return self_._op(*args, **(kwargs or {}))
RuntimeError: _convert_weight_to_int4pack_cuda is not available for build.

Edit: Seems like I'm not the only one on ROCm with this error: pytorch-labs/gpt-fast#12 (comment)

@oobabooga
Copy link
Owner Author

Thanks for the feedback @lufixSch. That's a bummer, it seems like every time some pure PyTorch loader appears, it doesn't really work on anything but NVIDIA.

@rraulison
Copy link

(/home/pai/text-generation-webui/installer_files/env) pai@localhost:~/text-generation-webui> python server.py --model Llama-2-7b-chat-hf --loader "gpt-fast"
10:24:51-722859 INFO     Starting Text generation web UI                                                                                                                          
10:24:51-728501 INFO     Loading Llama-2-7b-chat-hf                                                                                                                               
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  2.84it/s]
10:26:23-967663 INFO     LOADER: Transformers                                                                                                                                     
10:26:23-969554 INFO     TRUNCATION LENGTH: 4096                                                                                                                                  
10:26:23-970050 INFO     INSTRUCTION TEMPLATE: Custom (obtained from model metadata)                                                                                              
10:26:23-970568 INFO     Loaded the model in 92.24 seconds.                                                                                                                       
10:26:23-971045 INFO     Loading the extension "gallery"                                                                                                                          
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
10:26:41-825438 INFO     Deleted logs/chat/Assistant/20240107-10-01-41.json.                                                                                                      
Output generated in 284.83 seconds (1.23 tokens/s, 350 tokens, context 76, seed 1022857021)

Is it loaded as transformers model load on web-ui? It seems to work well on my 5600g AMD APU

@oobabooga
Copy link
Owner Author

No, it should appear as LOADER: gpt-fast. You can enforce that with --loader "gpt-fast"

@valiray
Copy link

valiray commented Jan 26, 2024

by the looks of it this is now only for nvidia for some reason

Z:\ai\text-generation-webui-gpt-fast>python repositories/quantize.py --checkpoint_path models/llama-2-7b-hf/model.pth --mode int4
Loading model ...
Quantizing model weights for int4 weight-only affine per-channel groupwise quantization
linear: layers.0.attention.wqkv, in=4096, out=12288
Traceback (most recent call last):
  File "Z:\ai\text-generation-webui-gpt-fast\repositories\quantize.py", line 605, in <module>
    quantize(args.checkpoint_path, args.mode, args.groupsize, args.calibration_tasks, args.calibration_limit, args.calibration_seq_length, args.pad_calibration_inputs, args.percdamp, args.blocksize, args.label)
  File "Z:\ai\text-generation-webui-gpt-fast\repositories\quantize.py", line 552, in quantize
    quantized_state_dict = quant_handler.create_quantized_state_dict()
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "Z:\ai\text-generation-webui-gpt-fast\installer_files\env\Lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "Z:\ai\text-generation-webui-gpt-fast\repositories\quantize.py", line 417, in create_quantized_state_dict
    weight.to(torch.bfloat16).to('cuda'), self.groupsize, self.inner_k_tiles
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "Z:\ai\text-generation-webui-gpt-fast\installer_files\env\Lib\site-packages\torch\cuda\__init__.py", line 307, in _lazy_init
    raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled

@oobabooga
Copy link
Owner Author

Yes, it does seem to not be universal after all despite being pure interpreted PyTorch. In that case, I don't see the appeal of including and maintaining this loader.

@oobabooga oobabooga closed this Jan 26, 2024
@oobabooga oobabooga deleted the gpt-fast branch January 27, 2024 20:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants