-
Notifications
You must be signed in to change notification settings - Fork 27.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Converting gguf fp16 & bf16 to hf is not supported. #31762
Comments
I found that the PPL issue is related to Llama3 or llama.cpp. It doesn't happen with TinyLlama. I'll create another issue to discuss if needed. |
It's easy to support GGUF FP16. Since BF16 is not supported by NumPy, my current workaround is to convert BF16 to FP16 using PyTorch, but it's not ideal to rely on PyTorch at this step. Reference: main...PenutChen:transformers:main def load_dequant_gguf_tensor(shape, ggml_type, data):
if ggml_type == GGML_TYPES["F32"]:
values = data
elif ggml_type == GGML_TYPES["F16"]:
values = data
elif ggml_type == GGML_TYPES["BF16"]:
import torch
data_uint8 = data.view(np.uint8)
tensor_uint8 = torch.from_numpy(data_uint8)
values = tensor_uint8.view(torch.bfloat16).float().numpy() Note that BF16 support requires modifying some code in gguf-py. Since the latest version of gguf-py from the llama.cpp repo doesn't work with the current HF integration (#31725), I modified the version from PyPI as follows: class GGMLQuantizationType(IntEnum):
F32 = 0
F16 = 1
BF16 = 30
# ...
GGML_QUANT_SIZES = {
GGMLQuantizationType.F32: (1, 4),
GGMLQuantizationType.F16: (1, 2),
GGMLQuantizationType.BF16: (1, 2),
# ...
} |
Hey @SunMarc, would you have some bandwidth to take a look at this ? :) |
Hey @PenutChen, thanks for your research ! I think that we should just support FP16 first since supporting BF16 would require a new gguf release + transformers gguf integration is not compatible yet. LMK what you think ! If you have some time, would you like a open a PR ? Otherwise, I will do it ! |
Hi @PenutChen , |
Hi @Lin-xs, this might be related to the incorrect reversed permutation implementation when dequantizing the model with GQA. This should be fixed in the latest version of Transformers by #31788. |
It works, thanks! |
Let's keep this open for bf16. After we fix the compatibility issue with the new gguf version, we can add bf16 cc @PenutChen |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
System Info
Who can help?
@SunMarc
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Expected behavior
Besides quantization, only F32 is implemented. FP16 and BF16 are not yet supported.
fp16 error log:
bf16 error log:
I tried to add F16 to
GGML_TYPES
:I'm not sure if this is correct, but after converting to hf, the PPL is over 1000.The text was updated successfully, but these errors were encountered: