[FEATURE] Replace QBits to IPEX for cpu inference #450

jiqing-feng · 2024-10-23T08:01:35Z

Hi @Qubitium . As Qbits is no longer developed, we are considering to replace qbits by ipex in open-source project. AutoAWQ has finished the convert, see here, it could bring better usage and performance.

BTW, the setup.py can only support CUDA now, so I will add some parameters to enable CPU setup. Please let me know if you have any concerns. Thx!

The text was updated successfully, but these errors were encountered:

Qubitium · 2024-10-23T08:26:43Z

Feel free to remove the qbits code if intel has stopped qbits devel and now concentrating on ipex.

On another note, is SYCL and IPEX also competing projects at Intel?

jiqing-feng · 2024-10-23T08:36:53Z

Feel free to remove the qbits code if intel has stopped qbits devel and now concentrating on ipex.

On another note, is SYCL and IPEX also competing projects at Intel?

They are compatible, SYCL mostly been used on Intel XPU which is our next step. I am currently focusing on CPU platform.

Qubitium · 2024-10-24T08:53:45Z

@jiqing-feng Heads up warning. We are doing quite a bit of cleanup on the codebase right now for 1.1 major release so base.py will be a little unstable until then but kernel level api should be stable.

jiqing-feng · 2024-10-24T09:05:04Z

@jiqing-feng Heads up warning. We are doing quite a bit of cleanup on the codebase right now for 1.1 major release so base.py will be a little unstable until then but kernel level api should be stable.

Sure, Thanks for remaindering. Can you give me an approximate time when it can be done?

Qubitium · 2024-10-24T09:22:12Z

@jiqing-feng We expect changes/refractor to be completed by Friday's end [Oct 25th].

Qubitium · 2024-10-28T03:27:22Z

@jiqing-feng refractor of base.py complete for v1.1.0 release.

jiqing-feng · 2024-11-01T07:29:01Z

@jiqing-feng refractor of base.py complete for v1.1.0 release.

Thanks!
Just FYI, the current GPTModel does not support setup without cuda, so I need to change some basic files like setup.py and gptqmodel/utils/model.py etc ... to support CPU setup. Please let me know if you have any concerns.

Qubitium · 2024-11-01T08:08:41Z

Just FYI, the current GPTModel does not support setup without cuda, so I need to change some basic files like setup.py and gptqmodel/utils/model.py etc ... to support CPU setup. Please let me know if you have any concerns.

No problem. Feel free to make changes that you see fit.

jiqing-feng · 2024-11-05T07:36:55Z

Hi @Qubitium . I almost finished the integration but met a little precision issue.

I checked the differences btw auto-gptq and gptqmodel with the following script:

from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM
from gptqmodel import GPTQModel, QuantizeConfig

model_id = "TheBloke/Llama-2-7B-Chat-GPTQ"
tokenizer = AutoTokenizer.from_pretrained(model_id)
device = "cuda"


old_model = AutoGPTQForCausalLM.from_quantized(model_id).to(device)
new_model = GPTQModel.from_quantized(model_id).to(device)
print(old_model.model.model.layers[0].self_attn.q_proj.qzeros)
print(new_model.model.model.layers[0].self_attn.q_proj.qzeros)

inputs = tokenizer("I am happy because", return_tensors="pt").to(device)

old_output = old_model.generate(**inputs, max_new_tokens=32)
new_output = new_model.generate(**inputs, max_new_tokens=32)

print("old model output")
print(tokenizer.decode(old_output[0]))
print("new model output")
print(tokenizer.decode(new_output[0]))

The outputs are:

tensor([[2004318071, 2004318071, 2004318071,  ..., 2004318071, 2004318071,
         2004318071],
        [2004318071, 2004318071, 2004318071,  ..., 2004318071, 2004318071,
         2004318071],
        [2004318071, 2004318071, 2004318071,  ..., 2004318071, 2004318071,
         2004318071],
        ...,
        [2004318071, 2004318071, 2004318071,  ..., 2004318071, 2004318071,
         2004318071],
        [2004318071, 2004318071, 2004318071,  ..., 2004318071, 2004318071,
         2004318071],
        [2004318071, 2004318071, 2004318071,  ..., 2004318071, 2004318071,
         2004318071]], device='cuda:0', dtype=torch.int32)
tensor([[-2004318072, -2004318072, -2004318072,  ..., -2004318072,
         -2004318072, -2004318072],
        [-2004318072, -2004318072, -2004318072,  ..., -2004318072,
         -2004318072, -2004318072],
        [-2004318072, -2004318072, -2004318072,  ..., -2004318072,
         -2004318072, -2004318072],
        ...,
        [-2004318072, -2004318072, -2004318072,  ..., -2004318072,
         -2004318072, -2004318072],
        [-2004318072, -2004318072, -2004318072,  ..., -2004318072,
         -2004318072, -2004318072],
        [-2004318072, -2004318072, -2004318072,  ..., -2004318072,
         -2004318072, -2004318072]], device='cuda:0', dtype=torch.int32)
Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)
old model output
<s> I am happy because I have found a new way to enjoy my favorite foods without feeling guilty or bloated.
I am excited to share my new discovery with you!
new model output
<s> I am happy because I have found a new way to enjoy my favorite foods without feeling guilty or bloated.
I am excited to share my new discovery with you!

The qzeros are different from the 2 models but the results are the same. Is it a part of the refactor? It would be great if you could let me know more details about it. Thanks!

Qubitium · 2024-11-05T08:04:32Z

@jiqing-feng Please check this commit. c80855e

The diff should be caused by this diff where @qwopqwop200 Fixed sym=False in the original auto-gptq but also changes v1/original format so that it can be correctly migrated to the internal v2 format. Right now, after the merged commit, internal code operates in v2 mode where sym=True/False can both work correctly. sym=False is only possible with gptq_v2 and saving to gptq aka gptq_v1 we will execute a conversion v2 to v1 stage on model save.

If you saved model to gptq format with sym=True, it will be upconverted to v2 format on model load/inference. So the tensor output should be the v2 format. A version check is made so that broken v1 (prior to the merge) v1 with sym=False will assert on load.

jiqing-feng · 2024-11-05T08:09:19Z

@jiqing-feng Please check this commit. c80855e

The diff should be caused by this diff where @qwopqwop200 Fixed sym=False in the original auto-gptq but also changes v1/original format so that it can be correctly migrated to the internal v2 format. Right now, after the merged commit, internal code operates in v2 mode where sym=True/False can both work correctly. sym=False is only possible with gptq_v2 and saving to gptq aka gptq_v1 we will execute a conversion v2 to v1 stage on model save.

If you saved model to gptq format with sym=True, it will be upconverted to v2 format on model load/inference. So the tensor output should be the v2 format. A version check is made so that broken v1 (prior to the merge) v1 with sym=False will assert on load.

It might be a risk because there exists v1 and v2 models in HF model hub, do you know how to detect a model is v1 or v2? The CPU currently only support V1, so we need to convert model from v2 to v1.

Qubitium · 2024-11-05T08:15:48Z

@jiqing-feng The compat issue is only with sym=False which was never used in the wild since the outputs quality was badly broken before the merge anyways. For older HF model sym=True there is no conversion error to v2, at least that was the result in the long PR review. Check here the for the discussion: #9

To prevent v1 + sym=False combo, quantized before the merge, we do a meta.quantizer (gptqmodel addded this) version in config which checks for gptqmodel version (post merge) and assert. Again, I have not seen a working public HF model with old autogptq gptq v1 with sym=False since they are broken to begin with.

Qubitium · 2024-11-05T08:21:48Z

@jiqing-feng Here is the correct PR link for the commit in question: AutoGPTQ/AutoGPTQ#640
Note that the PR in question is actually rebase of the PR at AutoGPTQ/AutoGPTQ#559 . Pr 559 is the original PR with the format/kernel changes.

jiqing-feng · 2024-11-05T08:50:45Z

Hi @Qubitium Thanks for the clarification. The PR is ready now, please see #527 .

Qubitium · 2024-11-05T09:22:22Z

@jiqing-feng Thanks for the PR. Note added.

On a side note, have you tested ipex on a xeon 6th gen (granite rapids) device? I see that xeon 5th (emerald rapids) gen has 50% ai improvement due to amx instructions vs 4th gen and on paper xeon 6th gen has another 100% gain vs 5th gen due to even better amx hardware. How much difference in performance have you seen between intel xeon 4th, 5th, and 6th gen when it comes to int8 of float ai performance? Thanks. If 6th gen is really that good, or as good as the intel tech-paper are saying, we might want to get one to test out.

jiqing-feng · 2024-11-05T09:26:25Z

@jiqing-feng Thanks for the PR. Note added.

On a side note, have you tested ipex on a xeon 6th gen (granite rapids) device? I see that xeon 5th (emerald rapids) gen has 50% ai improvement due to amx instructions vs 4th gen and on paper xeon 6th gen has another 100% gain vs 5th gen due to even better amx hardware. How much difference in performance have you seen between intel xeon 4th, 5th, and 6th gen when it comes to int8 of float ai performance? Thanks. If 6th gen is really that good, or as good as the intel tech-paper are saying, we might want to get one to test out.

I am currently developing this feature for the 4th Gen Xeon. Yes, the 5th and 6th Gen Xeon should have improvement, I will collect the data and share it with you once I finished the benchmark tests.

jiqing-feng added the bug Something isn't working label Oct 23, 2024

Qubitium changed the title ~~Replace qbits to IPEX so can enable CPU path~~ [FEATURE] Replace QBits to IPEX for cpu inference Oct 24, 2024

Qubitium closed this as completed Nov 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Replace QBits to IPEX for cpu inference #450

[FEATURE] Replace QBits to IPEX for cpu inference #450

jiqing-feng commented Oct 23, 2024

Qubitium commented Oct 23, 2024

jiqing-feng commented Oct 23, 2024

Qubitium commented Oct 24, 2024 •

edited

Loading

jiqing-feng commented Oct 24, 2024

Qubitium commented Oct 24, 2024

Qubitium commented Oct 28, 2024

jiqing-feng commented Nov 1, 2024 •

edited

Loading

Qubitium commented Nov 1, 2024

jiqing-feng commented Nov 5, 2024

Qubitium commented Nov 5, 2024 •

edited

Loading

jiqing-feng commented Nov 5, 2024 •

edited

Loading

Qubitium commented Nov 5, 2024 •

edited

Loading

Qubitium commented Nov 5, 2024

jiqing-feng commented Nov 5, 2024

Qubitium commented Nov 5, 2024 •

edited

Loading

jiqing-feng commented Nov 5, 2024

[FEATURE] Replace QBits to IPEX for cpu inference #450

[FEATURE] Replace QBits to IPEX for cpu inference #450

Comments

jiqing-feng commented Oct 23, 2024

Qubitium commented Oct 23, 2024

jiqing-feng commented Oct 23, 2024

Qubitium commented Oct 24, 2024 • edited Loading

jiqing-feng commented Oct 24, 2024

Qubitium commented Oct 24, 2024

Qubitium commented Oct 28, 2024

jiqing-feng commented Nov 1, 2024 • edited Loading

Qubitium commented Nov 1, 2024

jiqing-feng commented Nov 5, 2024

Qubitium commented Nov 5, 2024 • edited Loading

jiqing-feng commented Nov 5, 2024 • edited Loading

Qubitium commented Nov 5, 2024 • edited Loading

Qubitium commented Nov 5, 2024

jiqing-feng commented Nov 5, 2024

Qubitium commented Nov 5, 2024 • edited Loading

jiqing-feng commented Nov 5, 2024

Qubitium commented Oct 24, 2024 •

edited

Loading

jiqing-feng commented Nov 1, 2024 •

edited

Loading

Qubitium commented Nov 5, 2024 •

edited

Loading

jiqing-feng commented Nov 5, 2024 •

edited

Loading

Qubitium commented Nov 5, 2024 •

edited

Loading

Qubitium commented Nov 5, 2024 •

edited

Loading