-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add gpt-fast loader (pure PyTorch) #5180
Conversation
I tried running the quantization on my AMD RX 7900 XTX as described and it failed with the following output: Loading model ...
Quantizing model weights for int4 weight-only affine per-channel groupwise quantization
linear: layers.0.attention.wqkv, in=4096, out=12288
Traceback (most recent call last):
File "/data/linux_data/AI/LLM/WebUI/repositories/gpt-fast/quantize.py", line 605, in <module>
quantize(args.checkpoint_path, args.mode, args.groupsize, args.calibration_tasks, args.calibration_limit, args.calibration_seq_length, args.pad_calibration_inputs, args.percdamp, args.blocksize, args.label)
File "/data/linux_data/AI/LLM/WebUI/repositories/gpt-fast/quantize.py", line 552, in quantize
quantized_state_dict = quant_handler.create_quantized_state_dict()
File "/data/linux_data/AI/LLM/WebUI/.venvs/llm1100/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/data/linux_data/AI/LLM/WebUI/repositories/gpt-fast/quantize.py", line 416, in create_quantized_state_dict
weight_int4pack, scales_and_zeros = prepare_int4_weight_and_scales_and_zeros(
File "/data/linux_data/AI/LLM/WebUI/repositories/gpt-fast/quantize.py", line 351, in prepare_int4_weight_and_scales_and_zeros
weight_int4pack = torch.ops.aten._convert_weight_to_int4pack(weight_int32, inner_k_tiles)
File "/data/linux_data/AI/LLM/WebUI/.venvs/llm1100/lib/python3.10/site-packages/torch/_ops.py", line 825, in __call__
return self_._op(*args, **(kwargs or {}))
RuntimeError: _convert_weight_to_int4pack_cuda is not available for build. Edit: Seems like I'm not the only one on ROCm with this error: pytorch-labs/gpt-fast#12 (comment) |
Thanks for the feedback @lufixSch. That's a bummer, it seems like every time some pure PyTorch loader appears, it doesn't really work on anything but NVIDIA. |
Is it loaded as transformers model load on web-ui? It seems to work well on my 5600g AMD APU |
No, it should appear as |
by the looks of it this is now only for nvidia for some reason
|
Yes, it does seem to not be universal after all despite being pure interpreted PyTorch. In that case, I don't see the appeal of including and maintaining this loader. |
Reference: https://github.com/pytorch-labs/gpt-fast
The advantage of this backend is that it requires no wheels or compiled extensions. It is 100% PyTorch, so it should in principle work on any GPU (including AMD, Intel Arc, and Metal).
Feedback is welcome.
Installation
It is necessary to upgrade to the latest PyTorch (Nightly) and clone the repository:
The first command changes for each hardware and can be found here: https://pytorch.org/get-started/locally/
Model conversion
Load a model
TODO
It is possible to use
torch.compile()
with this backend, which improves performance by a factor of around 2. I haven't added the option yet.