-
Notifications
You must be signed in to change notification settings - Fork 27.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RWKV - Inference NF4 quantization broken, also Int8 quantization weirdness. #23848
Comments
Not sure quantization actually works for RWKV which has quite a few custom layers. cc @younesbelkada |
Hmm, I was able to do a 4bit finetuning with qlora last week at the very least targeting key value and receptance in the attention and feed forward blocks, it just seems like inference time is broken I confirmed my tuned checkpoints worked fine for inference at full precision and actually it worked fine for just the forward call in 8bit in Eleuther's lm-evaluation-harness too now that I think of it, not sure for 4bit. Just seems to break when calling generate |
Hi @iantbutler01
Could you elaborate more on the error? |
In regards to int8, I've been testing on the development branch, which includes the code you've merged there and it very much just produces For example, with
The call to generate raises an error,
Adding a logits processor that just prints out scores shows on the first token generated,
If I then set do_sample=False
It only generates end of text, where as the full precision model generates correctly. |
In regards to 4bit rescaling during inference is broken for NF4 quantization with RWKV if you try to run inference, with a RuntimeError: result type Float can't be cast to the desired output type Byte
And then if I turn rescaling off by setting
But yeah I have this all reproducible in the script I've linked in the issue. |
I see, thanks for sharing more details with me 1- int8 RWKV seems to not work with you. From the snippet I am seeing, you are calling from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig
model_id = "RWKV/rwkv-4-1b5-pile"
model = AutoModelForCausalLM.from_pretrained(model_id, load_in_8bit=True, device_map={"":0}).cuda()
tokenizer = AutoTokenizer.from_pretrained(model_id)
generation_config = GenerationConfig(max_new_tokens=20, pad_token_id=tokenizer.eos_token_id)
question = "Hello my name is"
inputs = tokenizer(question, return_tensors="pt").to(0)
output_int8 = model.generate((inputs["input_ids"]), generation_config=generation_config)
print(tokenizer.decode(output_int8[0], skip_special_tokens=True)) and the model directly predicts EOS token. The fix is to replace 2- RWKV + 4bit seems to be not supported for now. I will dig into that and let you know as soon as I have a fix |
Okay so 8bit is working fine now, thank you very much for the workaround! 4bit loaded in with this configuration:
Is still failing unfortunately, :(
|
I see, this is because you are using nested quantization |
Yes sorry about that, I had always intended this to be with double quant, that was in my original repro code, but I should have been more explicit when communicating it to you 👍 I tried it without double quantization and it does work. |
No problem and thanks for double checking, will get back once I fix the issue with nested quantization! |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
I think It should not be closed @younesbelkada |
Correct, it is known that RWKV double-quant 4bit inference does not work yet, not sure if I can propose a fix anytime soon because of the rescale layers operation here: https://github.com/huggingface/transformers/blob/main/src/transformers/models/rwkv/modeling_rwkv.py#L722 |
System Info
transformers
version: 4.30.0.dev0I'm using the
RWKV/rwkv-raven-14b
model.Rescaling is broken for NF4 quantization with RWKV
RuntimeError: result type Float can't be cast to the desired output type Byte
Looks like torch cannot do the conversion in _div
And then if I turn rescaling off, it looks like theres a projection issue somewhere,
RuntimeError: mat1 and mat2 shapes cannot be multiplied (43x5120 and 1x13107200)
Additionally, with Int8 quantization enabled RWKV just outputs the endoftext token, I added a logits processor to output the scores and they're all NaN:
Who can help?
@sgugger
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
I have a repo with everything setup in generate.py to be able to quickly repro here:
https://github.com/iantbutler01/rwkv-raven-qlora-4bit-instruct/blob/main/generate.py
pip install -U git+https://github.com/huggingface/transformers.git
pip install -U git+https://github.com/huggingface/peft.git
pip install -U git+https://github.com/huggingface/accelerate.git
pip install --upgrade bitsandbytes
And then run
python generate.py
in a python 3.10+ environment. Uncomment 8bit or 4bit bnb config as needed.Expected behavior
I would expect NF4 based quantization to work at all, and then for Int8 quantization for logits not to be NaN.
The text was updated successfully, but these errors were encountered: