-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE] Replace QBits to IPEX for cpu inference #450
Comments
Feel free to remove the qbits code if intel has stopped qbits devel and now concentrating on ipex. On another note, is SYCL and IPEX also competing projects at Intel? |
They are compatible, SYCL mostly been used on Intel XPU which is our next step. I am currently focusing on CPU platform. |
@jiqing-feng Heads up warning. We are doing quite a bit of cleanup on the codebase right now for 1.1 major release so |
Sure, Thanks for remaindering. Can you give me an approximate time when it can be done? |
@jiqing-feng We expect changes/refractor to be completed by Friday's end [Oct 25th]. |
@jiqing-feng refractor of base.py complete for v1.1.0 release. |
Thanks! |
No problem. Feel free to make changes that you see fit. |
Hi @Qubitium . I almost finished the integration but met a little precision issue. I checked the differences btw from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM
from gptqmodel import GPTQModel, QuantizeConfig
model_id = "TheBloke/Llama-2-7B-Chat-GPTQ"
tokenizer = AutoTokenizer.from_pretrained(model_id)
device = "cuda"
old_model = AutoGPTQForCausalLM.from_quantized(model_id).to(device)
new_model = GPTQModel.from_quantized(model_id).to(device)
print(old_model.model.model.layers[0].self_attn.q_proj.qzeros)
print(new_model.model.model.layers[0].self_attn.q_proj.qzeros)
inputs = tokenizer("I am happy because", return_tensors="pt").to(device)
old_output = old_model.generate(**inputs, max_new_tokens=32)
new_output = new_model.generate(**inputs, max_new_tokens=32)
print("old model output")
print(tokenizer.decode(old_output[0]))
print("new model output")
print(tokenizer.decode(new_output[0])) The outputs are:
The |
@jiqing-feng Please check this commit. c80855e The diff should be caused by this diff where @qwopqwop200 Fixed If you saved model to |
It might be a risk because there exists v1 and v2 models in HF model hub, do you know how to detect a model is v1 or v2? The CPU currently only support V1, so we need to convert model from v2 to v1. |
@jiqing-feng The compat issue is only with To prevent v1 + |
@jiqing-feng Here is the correct PR link for the commit in question: AutoGPTQ/AutoGPTQ#640 |
@jiqing-feng Thanks for the PR. Note added. On a side note, have you tested ipex on a xeon 6th gen (granite rapids) device? I see that xeon 5th (emerald rapids) gen has 50% ai improvement due to |
I am currently developing this feature for the 4th Gen Xeon. Yes, the 5th and 6th Gen Xeon should have improvement, I will collect the data and share it with you once I finished the benchmark tests. |
Hi @Qubitium . As Qbits is no longer developed, we are considering to replace qbits by ipex in open-source project. AutoAWQ has finished the convert, see here, it could bring better usage and performance.
BTW, the
setup.py
can only support CUDA now, so I will add some parameters to enable CPU setup. Please let me know if you have any concerns. Thx!The text was updated successfully, but these errors were encountered: