Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update readme #220

Merged
merged 3 commits into from
Aug 12, 2024
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 14 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,8 @@ AutoRound
AutoRound is an advanced quantization algorithm for low-bits LLM inference. It's tailored for a wide range
of models. Our method adopts sign gradient descent to fine-tune rounding values and minmax values of weights in just 200
steps,
which competes impressively against recent methods without introducing any additional inference overhead. The below
image presents an overview of AutoRound. Check out our updated paper on [arxiv](https://arxiv.org/pdf/2309.05516v4)
which competes impressively against recent methods without introducing any additional inference overhead and keeping low tuning cost. The below
image presents an overview of AutoRound. Check out our paper on [arxiv](https://arxiv.org/pdf/2309.05516v4) for more details and visit [low_bit_open_llm_leaderboard](https://huggingface.co/spaces/Intel/low_bit_open_llm_leaderboard) for more accuracy data across various models.

<div align="center">

Expand All @@ -24,12 +24,12 @@ image presents an overview of AutoRound. Check out our updated paper on [arxiv](

## What's New

* [2024/08] Enabled the export and inference of the quantized model to the AutoRound format on Intel Gaudi2 devices, please refer to [Intel/Qwen2-7B-int4-inc](https://huggingface.co/Intel/Qwen2-7B-int4-inc).
* [2024/08] AutoRound format supports Intel Gaudi2 devices. For an example, please please refer to [Intel/Qwen2-7B-int4-inc](https://huggingface.co/Intel/Qwen2-7B-int4-inc) for example.
wenhuach21 marked this conversation as resolved.
Show resolved Hide resolved
* [2024/08] AutoRound includes several experimental features, e.g., activation quantization, mx_fp data type, and fast tuning of norm/bias parameters.
* [2024/07] Important change: the default value of nsamples has been changed from 512 to 128 to reduce the memory usages, which may cause a slight accuracy drop in some scenarios
* [2024/06] AutoRound format supports mixed bit-widths and group sizes for inference, resolving the significant performance drop issue with the asymmetric kernel
* [2024/05] AutoRound supports lm-head quantization, saving 0.7G for LLaMA3-8B at W4G128.
* [2024/05] AutoRound performs well
in [low_bit_open_llm_leaderboard](https://huggingface.co/spaces/Intel/low_bit_open_llm_leaderboard)


## Prerequisites

Expand Down Expand Up @@ -138,7 +138,7 @@ autoround.save_quantized(output_dir) ##save_quantized(output_dir,format="auto_g

Please run the quantization code first

### Setup env for AutoRound format
### AutoRound format

**cuda**: git clone https://github.com/intel/auto-round.git && cd auto-round && pip install -vvv --no-build-isolation
-e .
Expand All @@ -152,7 +152,7 @@ Please run the quantization code first
**hpu**: docker image with Gaudi Software Stack is recommended. More details can be found
in [Gaudi Guide](https://docs.habana.ai/en/latest/).

### Gaudi2/ CPU/ GPU
#### Gaudi2/ CPU/ GPU

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
Expand All @@ -171,6 +171,13 @@ inputs = tokenizer(text, return_tensors="pt").to(model.device)
print(tokenizer.decode(model.generate(**inputs, max_new_tokens=50)[0]))
```

### AutoGPTQ/AutoAWQ format

1 Please save the quantized model by modifying the code as follows: `autoround.save_quantized(output_dir, format="auto_gptq")` or `autoround.save_quantized(output_dir, format="auto_awq")`.

2 Refer to their repositories to infer the model.


## Support List

| Model | Supported |
Expand Down