Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

iq2_xxs: tune quantization #5320

Merged
merged 1 commit into from
Feb 5, 2024
Merged

iq2_xxs: tune quantization #5320

merged 1 commit into from
Feb 5, 2024

Conversation

ikawrakow
Copy link
Contributor

@ikawrakow ikawrakow commented Feb 4, 2024

We get slightly better PPL, and we cut quantization time in nearly half.

The trick is to 1st quantize without forcing points onto the E8-lattice. We can then use a narrower search range around the block scale that we got that way, which gives the significant reduction in quantization time.

The code becomes simpler too, so it is a win-win.

Here is a comparison between PPL with this PR and PR #4773 for a context of 4096

Model File size (GiB) PPL Master PPL PR
Mistral-7B 1.855 6.446 6.448
LLaMA-v2-7B 1.728 7.067 7.048
LLaMA-v2-13B 3.295 5.728 5.672
LLaMA-v2-70B 17.03 4.079 4.057
Mixtral-8x7B 11.44 4.948 4.904

We get slightly better PPL, and we cut quantization time in
nearly half.

The trick is to 1st quantize without forcing points onto the E8-lattice.
We can then use a narrower search range around the block scale that we
got that way.
@sorasoras
Copy link

Could this applied to other IQ quants?

@ikawrakow
Copy link
Contributor Author

Could this applied to other IQ quants?

Yes, but with much less gain. I.e., one gets an increase in PPL if one reduces the scale search range as aggressively as here, or one can keep about the same PPL but with much lower speedup.

@ikawrakow ikawrakow merged commit 6fdfa2e into master Feb 5, 2024
54 of 56 checks passed
@ikawrakow ikawrakow deleted the ik/iq2xxs_tune branch February 5, 2024 08:46
@Nexesenex
Copy link
Contributor

Any noticeable speed-up you can offer us with a close to equal perplexity is interresting for the CPU poor, @ikawrakow!

jordankanter pushed a commit to jordankanter/llama.cpp that referenced this pull request Mar 13, 2024
We get slightly better PPL, and we cut quantization time in
nearly half.

The trick is to 1st quantize without forcing points onto the E8-lattice.
We can then use a narrower search range around the block scale that we
got that way.

Co-authored-by: Iwan Kawrakow <[email protected]>
hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024
We get slightly better PPL, and we cut quantization time in
nearly half.

The trick is to 1st quantize without forcing points onto the E8-lattice.
We can then use a narrower search range around the block scale that we
got that way.

Co-authored-by: Iwan Kawrakow <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants