Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IQ1_M: 1.75 bpw quantization #6302

Merged
merged 24 commits into from
Mar 26, 2024
Merged

IQ1_M: 1.75 bpw quantization #6302

merged 24 commits into from
Mar 26, 2024

Conversation

ikawrakow
Copy link
Contributor

While waiting for the 1.58 bit era...

Compared to IQ1_S:

  • Same codebook with 2048 entries, so 11 bits per 8 weights - 11/8 bpw
  • Blocks of 16 instead of blocks of 32 used by IQ1_S. Scales are 3 bit, so 3/16 bpw
  • Separate shift for each group of 8 weights instead of 1 shift per 32 weights. This costs 1/8 bpw

Along with the fp16 super-block scale this ends up being exactly 1.75 bpw.

The table shows a PPL comparison between IQ1_S and IQ1_M (this PR). Context is 2048 tokens for LLaMA-v1 and 4096 for all other models. The last column shows the rms_norm_epsilon used to generate the PR results.

Model PPL (IQ1_S) PPL (IQ1_M) rms_norm_epsilon
LLaMA-v1-7B 12.83 10.13 5e-5
LLaMA-v1-13B 8.338 7.236 4e-5
LLaMA-v1-30B 6.722 6.053 2.5e-5
LLaMA-v2-7B 11.86 9.335 1.875e-5
LLaMA-v2-13B 7.741 6.842 2e-5
LLaMA-v2-70B 5.211 4.829 3e-5
Mistral-7B 10.42 8.162 default
Mixtral8x7B 6.168 5.574 default

@Nexesenex Looking forward to your improved 2.0 / sub-2.0 bpw quantization mixes.

Kawrakow added 18 commits March 25, 2024 19:15
Very 1st shot I get PPL = 9.76 for LLaMA-v2-7B.
We get
PPL(LLaMA-v2-7B ) = 9.2810
PPL(LLaMA-v2-13B) = 6.8105

Not bad, but slightly higher than
  sqrt(PPL(IQ1_S) * PPL(IQ2_XXS))
which is the expected outcome given that IQ1_M is
halfway between IQ1_S and IQ2_XXS in terms of bpw.
From this, we would expect
 PPL = 9.14 for LLaMA-v2-7B
 PPL = 6.63 for LLaMA-v2-13B
There is slight increase in PPL, but the 0.0625 bpw reduction
in size is totally worth it.

We now have
PPL(LLaMA-v2-7B ) = 9.4469 at 1.96 bpw
PPL(LLaMA-v2-13B) = 6.8717 at 1.93 bpw
PPL(LLaMA-v2-70B) = 4.8568 at 1.85 bpw
Works, but very slow (10.5 t/s)
About the same performance as iq1_s.
It is pretty bad: PPL(LLaMA-v2-7B) = 34 if we quantize output.weight
with Q4_K.
11.65 t/s -> 14.9 t/s
After quantizing block scales redo the super-block scale fit.

PPL(LLaMA-v2-7B ) = 9.3346
PPL(LLaMA-v2-13B) = 6.8419
PPL(LLaMA-v2-70B) = 4.8294
PPL(Mistral-7B  ) = 8.1624
We have progressed to warnings being errors.
@Nexesenex
Copy link
Contributor

Nexesenex commented Mar 25, 2024

@ikawrakow Thank you so much, man!

I was almost done with my IQ1_S strategy, Mixtral caused me trouble (it's heavy to requant it endlessly) but I found my mistake and now it works as intended, with sizeable improvements on perplexity and often on ARC benches.

I will PR tonight or tomorrow an 1Q1_XS LLAMA_FTYPE, which offers an almost comparable quality to your current IQ1_S LLAMA_FTYPE with a slight reduction in size, to act as a new "demo of the smallest quant", before being refactored with IQ1_M GGML_Type for an ulterior PR.

As for the IQ1_S LLAMA_TYPE I revamped, it's almost ready as well, and will follow shortly after in another PR, before being refactored with IQ1_M GGML_Type as well for an ulterior PR.

Then I'll (and/or you and/or anyone lol) work on a derived IQ1_M LLAMA_FTYPE to make the best sub 2bpw quant possible.

ggml-cuda/convert.cu Outdated Show resolved Hide resolved
ggml-cuda/convert.cu Outdated Show resolved Hide resolved
ggml.h Outdated
GGML_TYPE_I16 = 26,
GGML_TYPE_I32 = 27,
GGML_TYPE_I64 = 28,
GGML_TYPE_F64 = 29,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to also update the enum in gguf-py/gguf/constants.py:

class GGMLQuantizationType(IntEnum):
F32 = 0
F16 = 1
Q4_0 = 2
Q4_1 = 3
Q5_0 = 6
Q5_1 = 7
Q8_0 = 8
Q8_1 = 9
Q2_K = 10
Q3_K = 11
Q4_K = 12
Q5_K = 13
Q6_K = 14
Q8_K = 15
IQ2_XXS = 16
IQ2_XS = 17
IQ3_XXS = 18
IQ1_S = 19
IQ4_NL = 20
IQ3_S = 21
IQ2_S = 22
IQ4_XS = 23
I8 = 24
I16 = 25
I32 = 26
I64 = 27
F64 = 28

Also, move GGML_TYPE_IQ1_M at the end of the enum to keep backwards compatibility with any GGUF files that might have started using integer of 64-bit types

@ikawrakow ikawrakow merged commit 55c1b2a into master Mar 26, 2024
51 of 57 checks passed
@ikawrakow ikawrakow deleted the ik/iq1_m_new branch March 26, 2024 14:21
@Nexesenex
Copy link
Contributor

@ikawrakow, the IQ1_M quant is like twice slower to quantize than IQ1_S (on a I7-6700K with AVX and AVX2 enabled). Is there something to do about that?

@ikawrakow
Copy link
Contributor Author

@ikawrakow, the IQ1_M quant is like twice slower to quantize than IQ1_S (on a I7-6700K with AVX and AVX2 enabled). Is there something to do about that?

Sorry, I did not see a way to make it more efficient. It is doing 4X the work, so being 2X slower is not too bad. Both, IQ1_S and IQ1_M, use the exact solution of the mixed integer optimization problem that minimizes the difference between the fp16 weights and the tertiary quantization used by these quants. I have found that heuristics that work faster but are not guaranteed to find the best solution tend to produce significantly worst quantization. The solution method in IQ1_S is very effective, being O(BS^2), where BS is the block size (32 weights). But in IQ1_M we have a separate shift for each group of 8, so the only solution technique I see is O(BS^3) (but now BS = 16, so 4X the work).

@ikawrakow
Copy link
Contributor Author

@Nexesenex

I do have another version of IQ1_M that uses 1.8125 bpw. Quantization is much faster (basically the same as IQ1_S), and PPL vs size tradeoff is better (see this graph that shows results for LLaMA-v2-70B)

iq1_70

The reason I'm reluctant to make a PR is that it uses an even larger codebook (4096 entries vs 2048 in IQ1_M on master). CUDA on my GPU (RTX-4080) handles the associated large lookup table quite OK - performance decreases only by ~2% from 198 t/s to 190 t/s for a 7B model. But on my Ryzen-5950X CPU, the AVX2 implementation drops to 15 t/s from 24 t/s. I have not even bothered implementing for Apple Silicon, but based on the experience with other quants, I'm expecting a complete disaster there.

@Nexesenex
Copy link
Contributor

Nexesenex commented Mar 29, 2024

@ikawrakow I understand that speed on all platforms has its relative importance in the final choices, like size do, but it's a pity to leave such jewels on a shelf!

Could you eventually share the quant as a "CUDA optimized quant" for those interested to use it?

Ultimately, even if the approach "one quant for all archs" is pertinent for the sake of optimal compatibility, the differences in architectures should also be accounted for as well to not only rely on the "common denominator", but also on the "best for each case" in order to have SOTA quants for most of "broad particular cases", like CUDA is.

In my opinion, if LlamaCPP doesn't integrate this approach, some others will eventually.

hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024
* iq1_m: basics

* iq1_m: basics-2

* iq1_m: CUDA dequantize works

Very 1st shot I get PPL = 9.76 for LLaMA-v2-7B.

* iq1_m: separate shifts for each group of 8 in a block

We get
PPL(LLaMA-v2-7B ) = 9.2810
PPL(LLaMA-v2-13B) = 6.8105

Not bad, but slightly higher than
  sqrt(PPL(IQ1_S) * PPL(IQ2_XXS))
which is the expected outcome given that IQ1_M is
halfway between IQ1_S and IQ2_XXS in terms of bpw.
From this, we would expect
 PPL = 9.14 for LLaMA-v2-7B
 PPL = 6.63 for LLaMA-v2-13B

* iq1_m: go to 3-bit scales

There is slight increase in PPL, but the 0.0625 bpw reduction
in size is totally worth it.

We now have
PPL(LLaMA-v2-7B ) = 9.4469 at 1.96 bpw
PPL(LLaMA-v2-13B) = 6.8717 at 1.93 bpw
PPL(LLaMA-v2-70B) = 4.8568 at 1.85 bpw

* iq1_m: scalar dot product

* iq1_m: AVX2 dot product

* iq1_m: very slightly faster AVX2 dot product

* iq1_m: ARM_NEON dot product

Works, but very slow (10.5 t/s)

* iq1_m: Metal - dequantize works, dot product does not

* iq1_m: Metal now works

About the same performance as iq1_s.

* iq1_m: minor

* iq1_m: checking pure iq1_m quantization

It is pretty bad: PPL(LLaMA-v2-7B) = 34 if we quantize output.weight
with Q4_K.

* iiq1_m: slightly faster ARM_NEON dot product

10.5 t/s -> 11.65 t/s

* iq1_m: faster ARM_NEON dot product

11.65 t/s -> 14.9 t/s

* iq1_m: another minor ARM_NEON dot product improvement

14.9 -> 15.0 t/s

* iq1_m: small PPL improvement via super-block scale adjustment

After quantizing block scales redo the super-block scale fit.

PPL(LLaMA-v2-7B ) = 9.3346
PPL(LLaMA-v2-13B) = 6.8419
PPL(LLaMA-v2-70B) = 4.8294
PPL(Mistral-7B  ) = 8.1624

* iq1_m: adapt to CUDA refactoring

* iq1_m: remove unused variable

We have progressed to warnings being errors.

* iq1_m: add to backend-ops tests

* iq1_m: fix Windows ARM

* iq1_m: use common definition of iq1m_scale_t

* cuda: assert -> NO_DEVICE_CODE

* iq1_M: PR comments

---------

Co-authored-by: Iwan Kawrakow <[email protected]>
hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 3, 2024
* iq1_m: basics

* iq1_m: basics-2

* iq1_m: CUDA dequantize works

Very 1st shot I get PPL = 9.76 for LLaMA-v2-7B.

* iq1_m: separate shifts for each group of 8 in a block

We get
PPL(LLaMA-v2-7B ) = 9.2810
PPL(LLaMA-v2-13B) = 6.8105

Not bad, but slightly higher than
  sqrt(PPL(IQ1_S) * PPL(IQ2_XXS))
which is the expected outcome given that IQ1_M is
halfway between IQ1_S and IQ2_XXS in terms of bpw.
From this, we would expect
 PPL = 9.14 for LLaMA-v2-7B
 PPL = 6.63 for LLaMA-v2-13B

* iq1_m: go to 3-bit scales

There is slight increase in PPL, but the 0.0625 bpw reduction
in size is totally worth it.

We now have
PPL(LLaMA-v2-7B ) = 9.4469 at 1.96 bpw
PPL(LLaMA-v2-13B) = 6.8717 at 1.93 bpw
PPL(LLaMA-v2-70B) = 4.8568 at 1.85 bpw

* iq1_m: scalar dot product

* iq1_m: AVX2 dot product

* iq1_m: very slightly faster AVX2 dot product

* iq1_m: ARM_NEON dot product

Works, but very slow (10.5 t/s)

* iq1_m: Metal - dequantize works, dot product does not

* iq1_m: Metal now works

About the same performance as iq1_s.

* iq1_m: minor

* iq1_m: checking pure iq1_m quantization

It is pretty bad: PPL(LLaMA-v2-7B) = 34 if we quantize output.weight
with Q4_K.

* iiq1_m: slightly faster ARM_NEON dot product

10.5 t/s -> 11.65 t/s

* iq1_m: faster ARM_NEON dot product

11.65 t/s -> 14.9 t/s

* iq1_m: another minor ARM_NEON dot product improvement

14.9 -> 15.0 t/s

* iq1_m: small PPL improvement via super-block scale adjustment

After quantizing block scales redo the super-block scale fit.

PPL(LLaMA-v2-7B ) = 9.3346
PPL(LLaMA-v2-13B) = 6.8419
PPL(LLaMA-v2-70B) = 4.8294
PPL(Mistral-7B  ) = 8.1624

* iq1_m: adapt to CUDA refactoring

* iq1_m: remove unused variable

We have progressed to warnings being errors.

* iq1_m: add to backend-ops tests

* iq1_m: fix Windows ARM

* iq1_m: use common definition of iq1m_scale_t

* cuda: assert -> NO_DEVICE_CODE

* iq1_M: PR comments

---------

Co-authored-by: Iwan Kawrakow <[email protected]>
tybalex pushed a commit to rubra-ai/tools.cpp that referenced this pull request Apr 17, 2024
* iq1_m: basics

* iq1_m: basics-2

* iq1_m: CUDA dequantize works

Very 1st shot I get PPL = 9.76 for LLaMA-v2-7B.

* iq1_m: separate shifts for each group of 8 in a block

We get
PPL(LLaMA-v2-7B ) = 9.2810
PPL(LLaMA-v2-13B) = 6.8105

Not bad, but slightly higher than
  sqrt(PPL(IQ1_S) * PPL(IQ2_XXS))
which is the expected outcome given that IQ1_M is
halfway between IQ1_S and IQ2_XXS in terms of bpw.
From this, we would expect
 PPL = 9.14 for LLaMA-v2-7B
 PPL = 6.63 for LLaMA-v2-13B

* iq1_m: go to 3-bit scales

There is slight increase in PPL, but the 0.0625 bpw reduction
in size is totally worth it.

We now have
PPL(LLaMA-v2-7B ) = 9.4469 at 1.96 bpw
PPL(LLaMA-v2-13B) = 6.8717 at 1.93 bpw
PPL(LLaMA-v2-70B) = 4.8568 at 1.85 bpw

* iq1_m: scalar dot product

* iq1_m: AVX2 dot product

* iq1_m: very slightly faster AVX2 dot product

* iq1_m: ARM_NEON dot product

Works, but very slow (10.5 t/s)

* iq1_m: Metal - dequantize works, dot product does not

* iq1_m: Metal now works

About the same performance as iq1_s.

* iq1_m: minor

* iq1_m: checking pure iq1_m quantization

It is pretty bad: PPL(LLaMA-v2-7B) = 34 if we quantize output.weight
with Q4_K.

* iiq1_m: slightly faster ARM_NEON dot product

10.5 t/s -> 11.65 t/s

* iq1_m: faster ARM_NEON dot product

11.65 t/s -> 14.9 t/s

* iq1_m: another minor ARM_NEON dot product improvement

14.9 -> 15.0 t/s

* iq1_m: small PPL improvement via super-block scale adjustment

After quantizing block scales redo the super-block scale fit.

PPL(LLaMA-v2-7B ) = 9.3346
PPL(LLaMA-v2-13B) = 6.8419
PPL(LLaMA-v2-70B) = 4.8294
PPL(Mistral-7B  ) = 8.1624

* iq1_m: adapt to CUDA refactoring

* iq1_m: remove unused variable

We have progressed to warnings being errors.

* iq1_m: add to backend-ops tests

* iq1_m: fix Windows ARM

* iq1_m: use common definition of iq1m_scale_t

* cuda: assert -> NO_DEVICE_CODE

* iq1_M: PR comments

---------

Co-authored-by: Iwan Kawrakow <[email protected]>
@mofosyne mofosyne added Tensor Encoding Scheme https://github.com/ggerganov/llama.cpp/wiki/Tensor-Encoding-Schemes Review Complexity : High Generally require indepth knowledge of LLMs or GPUs labels May 25, 2024
mishig25 pushed a commit to huggingface/huggingface.js that referenced this pull request Jun 3, 2024
Bring `GGMLQuantizationType` up to date; adds `I8`, `I16`, `I32`, `I64`,
`F64`, `IQ1_M` and `BF16`.

Added in:
* ggerganov/llama.cpp#6045
* ggerganov/llama.cpp#6062
* ggerganov/llama.cpp#6302
* ggerganov/llama.cpp#6412
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Review Complexity : High Generally require indepth knowledge of LLMs or GPUs Tensor Encoding Scheme https://github.com/ggerganov/llama.cpp/wiki/Tensor-Encoding-Schemes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants