Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Replace QBits to IPEX for cpu inference #450

Closed
jiqing-feng opened this issue Oct 23, 2024 · 16 comments
Closed

[FEATURE] Replace QBits to IPEX for cpu inference #450

jiqing-feng opened this issue Oct 23, 2024 · 16 comments
Labels
bug Something isn't working

Comments

@jiqing-feng
Copy link
Contributor

Hi @Qubitium . As Qbits is no longer developed, we are considering to replace qbits by ipex in open-source project. AutoAWQ has finished the convert, see here, it could bring better usage and performance.

BTW, the setup.py can only support CUDA now, so I will add some parameters to enable CPU setup. Please let me know if you have any concerns. Thx!

@jiqing-feng jiqing-feng added the bug Something isn't working label Oct 23, 2024
@Qubitium
Copy link
Collaborator

Feel free to remove the qbits code if intel has stopped qbits devel and now concentrating on ipex.

On another note, is SYCL and IPEX also competing projects at Intel?

@jiqing-feng
Copy link
Contributor Author

Feel free to remove the qbits code if intel has stopped qbits devel and now concentrating on ipex.

On another note, is SYCL and IPEX also competing projects at Intel?

They are compatible, SYCL mostly been used on Intel XPU which is our next step. I am currently focusing on CPU platform.

@Qubitium Qubitium changed the title Replace qbits to IPEX so can enable CPU path [FEATURE] Replace QBits to IPEX for cpu inference Oct 24, 2024
@Qubitium
Copy link
Collaborator

Qubitium commented Oct 24, 2024

@jiqing-feng Heads up warning. We are doing quite a bit of cleanup on the codebase right now for 1.1 major release so base.py will be a little unstable until then but kernel level api should be stable.

@jiqing-feng
Copy link
Contributor Author

@jiqing-feng Heads up warning. We are doing quite a bit of cleanup on the codebase right now for 1.1 major release so base.py will be a little unstable until then but kernel level api should be stable.

Sure, Thanks for remaindering. Can you give me an approximate time when it can be done?

@Qubitium
Copy link
Collaborator

@jiqing-feng We expect changes/refractor to be completed by Friday's end [Oct 25th].

@Qubitium
Copy link
Collaborator

@jiqing-feng refractor of base.py complete for v1.1.0 release.

@jiqing-feng
Copy link
Contributor Author

jiqing-feng commented Nov 1, 2024

@jiqing-feng refractor of base.py complete for v1.1.0 release.

Thanks!
Just FYI, the current GPTModel does not support setup without cuda, so I need to change some basic files like setup.py and gptqmodel/utils/model.py etc ... to support CPU setup. Please let me know if you have any concerns.

@Qubitium
Copy link
Collaborator

Qubitium commented Nov 1, 2024

Just FYI, the current GPTModel does not support setup without cuda, so I need to change some basic files like setup.py and gptqmodel/utils/model.py etc ... to support CPU setup. Please let me know if you have any concerns.

No problem. Feel free to make changes that you see fit.

@jiqing-feng
Copy link
Contributor Author

Hi @Qubitium . I almost finished the integration but met a little precision issue.

I checked the differences btw auto-gptq and gptqmodel with the following script:

from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM
from gptqmodel import GPTQModel, QuantizeConfig

model_id = "TheBloke/Llama-2-7B-Chat-GPTQ"
tokenizer = AutoTokenizer.from_pretrained(model_id)
device = "cuda"


old_model = AutoGPTQForCausalLM.from_quantized(model_id).to(device)
new_model = GPTQModel.from_quantized(model_id).to(device)
print(old_model.model.model.layers[0].self_attn.q_proj.qzeros)
print(new_model.model.model.layers[0].self_attn.q_proj.qzeros)

inputs = tokenizer("I am happy because", return_tensors="pt").to(device)

old_output = old_model.generate(**inputs, max_new_tokens=32)
new_output = new_model.generate(**inputs, max_new_tokens=32)

print("old model output")
print(tokenizer.decode(old_output[0]))
print("new model output")
print(tokenizer.decode(new_output[0]))

The outputs are:

tensor([[2004318071, 2004318071, 2004318071,  ..., 2004318071, 2004318071,
         2004318071],
        [2004318071, 2004318071, 2004318071,  ..., 2004318071, 2004318071,
         2004318071],
        [2004318071, 2004318071, 2004318071,  ..., 2004318071, 2004318071,
         2004318071],
        ...,
        [2004318071, 2004318071, 2004318071,  ..., 2004318071, 2004318071,
         2004318071],
        [2004318071, 2004318071, 2004318071,  ..., 2004318071, 2004318071,
         2004318071],
        [2004318071, 2004318071, 2004318071,  ..., 2004318071, 2004318071,
         2004318071]], device='cuda:0', dtype=torch.int32)
tensor([[-2004318072, -2004318072, -2004318072,  ..., -2004318072,
         -2004318072, -2004318072],
        [-2004318072, -2004318072, -2004318072,  ..., -2004318072,
         -2004318072, -2004318072],
        [-2004318072, -2004318072, -2004318072,  ..., -2004318072,
         -2004318072, -2004318072],
        ...,
        [-2004318072, -2004318072, -2004318072,  ..., -2004318072,
         -2004318072, -2004318072],
        [-2004318072, -2004318072, -2004318072,  ..., -2004318072,
         -2004318072, -2004318072],
        [-2004318072, -2004318072, -2004318072,  ..., -2004318072,
         -2004318072, -2004318072]], device='cuda:0', dtype=torch.int32)
Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)
old model output
<s> I am happy because I have found a new way to enjoy my favorite foods without feeling guilty or bloated.
I am excited to share my new discovery with you!
new model output
<s> I am happy because I have found a new way to enjoy my favorite foods without feeling guilty or bloated.
I am excited to share my new discovery with you!

The qzeros are different from the 2 models but the results are the same. Is it a part of the refactor? It would be great if you could let me know more details about it. Thanks!

@Qubitium
Copy link
Collaborator

Qubitium commented Nov 5, 2024

@jiqing-feng Please check this commit. c80855e

The diff should be caused by this diff where @qwopqwop200 Fixed sym=False in the original auto-gptq but also changes v1/original format so that it can be correctly migrated to the internal v2 format. Right now, after the merged commit, internal code operates in v2 mode where sym=True/False can both work correctly. sym=False is only possible with gptq_v2 and saving to gptq aka gptq_v1 we will execute a conversion v2 to v1 stage on model save.

If you saved model to gptq format with sym=True, it will be upconverted to v2 format on model load/inference. So the tensor output should be the v2 format. A version check is made so that broken v1 (prior to the merge) v1 with sym=False will assert on load.

@jiqing-feng
Copy link
Contributor Author

jiqing-feng commented Nov 5, 2024

@jiqing-feng Please check this commit. c80855e

The diff should be caused by this diff where @qwopqwop200 Fixed sym=False in the original auto-gptq but also changes v1/original format so that it can be correctly migrated to the internal v2 format. Right now, after the merged commit, internal code operates in v2 mode where sym=True/False can both work correctly. sym=False is only possible with gptq_v2 and saving to gptq aka gptq_v1 we will execute a conversion v2 to v1 stage on model save.

If you saved model to gptq format with sym=True, it will be upconverted to v2 format on model load/inference. So the tensor output should be the v2 format. A version check is made so that broken v1 (prior to the merge) v1 with sym=False will assert on load.

It might be a risk because there exists v1 and v2 models in HF model hub, do you know how to detect a model is v1 or v2? The CPU currently only support V1, so we need to convert model from v2 to v1.

@Qubitium
Copy link
Collaborator

Qubitium commented Nov 5, 2024

@jiqing-feng The compat issue is only with sym=False which was never used in the wild since the outputs quality was badly broken before the merge anyways. For older HF model sym=True there is no conversion error to v2, at least that was the result in the long PR review. Check here the for the discussion: #9

To prevent v1 + sym=False combo, quantized before the merge, we do a meta.quantizer (gptqmodel addded this) version in config which checks for gptqmodel version (post merge) and assert. Again, I have not seen a working public HF model with old autogptq gptq v1 with sym=False since they are broken to begin with.

@Qubitium
Copy link
Collaborator

Qubitium commented Nov 5, 2024

@jiqing-feng Here is the correct PR link for the commit in question: AutoGPTQ/AutoGPTQ#640
Note that the PR in question is actually rebase of the PR at AutoGPTQ/AutoGPTQ#559 . Pr 559 is the original PR with the format/kernel changes.

@jiqing-feng
Copy link
Contributor Author

Hi @Qubitium Thanks for the clarification. The PR is ready now, please see #527 .

@Qubitium
Copy link
Collaborator

Qubitium commented Nov 5, 2024

@jiqing-feng Thanks for the PR. Note added.

On a side note, have you tested ipex on a xeon 6th gen (granite rapids) device? I see that xeon 5th (emerald rapids) gen has 50% ai improvement due to amx instructions vs 4th gen and on paper xeon 6th gen has another 100% gain vs 5th gen due to even better amx hardware. How much difference in performance have you seen between intel xeon 4th, 5th, and 6th gen when it comes to int8 of float ai performance? Thanks. If 6th gen is really that good, or as good as the intel tech-paper are saying, we might want to get one to test out.

@jiqing-feng
Copy link
Contributor Author

@jiqing-feng Thanks for the PR. Note added.

On a side note, have you tested ipex on a xeon 6th gen (granite rapids) device? I see that xeon 5th (emerald rapids) gen has 50% ai improvement due to amx instructions vs 4th gen and on paper xeon 6th gen has another 100% gain vs 5th gen due to even better amx hardware. How much difference in performance have you seen between intel xeon 4th, 5th, and 6th gen when it comes to int8 of float ai performance? Thanks. If 6th gen is really that good, or as good as the intel tech-paper are saying, we might want to get one to test out.

I am currently developing this feature for the 4th Gen Xeon. Yes, the 5th and 6th Gen Xeon should have improvement, I will collect the data and share it with you once I finished the benchmark tests.

@Qubitium Qubitium closed this as completed Nov 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants