Add LLMs quantization model list and recipes (#1504)

Signed-off-by: chensuyue <[email protected]>
intel · Dec 29, 2023 · f19cc9d · f19cc9d
1 parent 7634409
commit f19cc9d
Show file tree

Hide file tree

Showing 7 changed files with 48 additions and 8 deletions.
diff --git a/README.md b/README.md
@@ -5,12 +5,12 @@ Intel® Neural Compressor
 <h3> An open-source Python library supporting popular model compression techniques on all mainstream deep learning frameworks (TensorFlow, PyTorch, ONNX Runtime, and MXNet)</h3>
 
 [![python](https://img.shields.io/badge/python-3.8%2B-blue)](https://github.com/intel/neural-compressor)
-[![version](https://img.shields.io/badge/release-2.4-green)](https://github.com/intel/neural-compressor/releases)
+[![version](https://img.shields.io/badge/release-2.4.1-green)](https://github.com/intel/neural-compressor/releases)
 [![license](https://img.shields.io/badge/license-Apache%202-blue)](https://github.com/intel/neural-compressor/blob/master/LICENSE)
 [![coverage](https://img.shields.io/badge/coverage-85%25-green)](https://github.com/intel/neural-compressor)
 [![Downloads](https://static.pepy.tech/personalized-badge/neural-compressor?period=total&units=international_system&left_color=grey&right_color=green&left_text=downloads)](https://pepy.tech/project/neural-compressor)
 
-[Architecture](./docs/source/design.md#architecture)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[Workflow](./docs/source/design.md#workflow)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[Results](./docs/source/validated_model_list.md)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[Examples](./examples/README.md)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[Documentations](https://intel.github.io/neural-compressor)
+[Architecture](./docs/source/design.md#architecture)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[Workflow](./docs/source/design.md#workflow)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[LLMs Recipes](./docs/source/llm_recipes.md)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[Results](./docs/source/validated_model_list.md)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[Documentations](https://intel.github.io/neural-compressor)
 
 ---
 <div align="left">
@@ -72,8 +72,9 @@ q_model = fit(
     <tr>
       <td colspan="2" align="center"><a href="./docs/source/design.md#architecture">Architecture</a></td>
       <td colspan="2" align="center"><a href="./docs/source/design.md#workflow">Workflow</a></td>
+      <td colspan="1" align="center"><a href="https://intel.github.io/neural-compressor/latest/docs/source/api-doc/apis.html">APIs</a></td>
+      <td colspan="1" align="center"><a href="./docs/source/llm_recipes.md">LLMs Recipes</a></td>
       <td colspan="2" align="center"><a href="examples/README.md">Examples</a></td>
-      <td colspan="2" align="center"><a href="https://intel.github.io/neural-compressor/latest/docs/source/api-doc/apis.html">APIs</a></td>
     </tr>
   </tbody>
   <thead>

diff --git a/conda_meta/basic/meta.yaml b/conda_meta/basic/meta.yaml
@@ -1,4 +1,4 @@
-{% set version = "2.4" %}
+{% set version = "2.4.1" %}
 {% set buildnumber = 0 %}
 package:
   name: neural-compressor

diff --git a/conda_meta/neural_insights/meta.yaml b/conda_meta/neural_insights/meta.yaml
@@ -1,4 +1,4 @@
-{% set version = "2.4" %}
+{% set version = "2.4.1" %}
 {% set buildnumber = 0 %}
 package:
   name: neural-insights

diff --git a/conda_meta/neural_solution/meta.yaml b/conda_meta/neural_solution/meta.yaml
@@ -1,4 +1,4 @@
-{% set version = "2.4" %}
+{% set version = "2.4.1" %}
 {% set buildnumber = 0 %}
 package:
   name: neural-solution

diff --git a/docs/source/llm_recipes.md b/docs/source/llm_recipes.md
@@ -0,0 +1,27 @@
+LLM Quantization Models and Recipes
+---
+
+Intel® Neural Compressor supported advanced large language models (LLMs) quantization technologies including SmoothQuant (SQ) and Weight-Only Quant (WOQ), 
+and verified a list of LLMs on 4th Gen Intel® Xeon® Scalable Processor (codenamed Sapphire Rapids) with [PyTorch](https://pytorch.org/), 
+[Intel® Extension for PyTorch](https://github.com/intel/intel-extension-for-pytorch) and [Intel® Extension for Transformers](https://github.com/intel/intel-extension-for-transformers).   
+This document aims to publish the specific recipes we achieved for the popular LLMs and help users to quickly get an optimized LLM with limited 1% accuracy loss.
+
+> Notes:  
+> - The quantization algorithms provide by [Intel® Neural Compressor](https://github.com/intel/neural-compressor) and the evaluate functions provide by [Intel® Extension for Transformers](https://github.com/intel/intel-extension-for-transformers).    
+> - The model list are continuing update, please expect to find more LLMs in the future. 
+
+## IPEX key models
+|          Models           | SQ INT8 | WOQ INT8 | WOQ INT4 |
+|:-------------------------:|---------|:--------:|:--------:|
+|    EleutherAI/gpt-j-6b    |    ✔    |    ✔     |    ✔    |
+|   facebook/opt-1.3b       |    ✔    |    ✔     |    ✔    |
+|     facebook/opt-30b      |    ✔    |    ✔     |    ✔    |
+| meta-llama/Llama-2-7b-hf  |    ✔    |    ✔     |    ✔    |
+| meta-llama/Llama-2-13b-hf |    ✔    |    ✔     |    ✔    |
+| meta-llama/Llama-2-70b-hf |    ✔    |    ✔     |    ✔    |
+|     tiiuae/falcon-40b     |    ✔    |    ✔     |    ✔    |
+
+**Detail recipes can be found [HERE](https://github.com/intel/intel-extension-for-transformers/examples/huggingface/pytorch/text-generation/quantization/llm_quantization_recipes.md).**
+> Notes: 
+> - This model list comes from [IPEX](https://intel.github.io/intel-extension-for-pytorch/cpu/latest/tutorials/llm.html).  
+> - WOQ INT4 recipes will be published soon.    
diff --git a/examples/pytorch/nlp/huggingface_models/language-modeling/quantization/llm/run_benchmark.sh b/examples/pytorch/nlp/huggingface_models/language-modeling/quantization/llm/run_benchmark.sh
@@ -79,6 +79,10 @@ function run_benchmark {
         model_name_or_path="facebook/opt-125m"
         approach="weight_only"
         extra_cmd=$extra_cmd" --woq_algo GPTQ"
+    elif [ "${topology}" = "opt_125m_woq_gptq_debug_int4" ]; then
+        model_name_or_path="facebook/opt-125m"
+        approach="weight_only"
+        extra_cmd=$extra_cmd" --woq_algo GPTQ --woq_bits 4 --woq_scheme asym --woq_group_size 128 --gptq_use_max_length --gptq_debug"
     elif [ "${topology}" = "opt_125m_woq_teq" ]; then
         model_name_or_path="facebook/opt-125m"
         approach="weight_only"
@@ -98,13 +102,21 @@ function run_benchmark {
     elif [ "${topology}" = "gpt_j_ipex_sq" ]; then
         model_name_or_path="EleutherAI/gpt-j-6b"
         extra_cmd=$extra_cmd" --ipex --sq --alpha 1.0"
-    elif [ "${topology}" = "gpt_j_woq_rtn" ]; then
+    elif [ "${topology}" = "gpt_j_woq_rtn_int4" ]; then
         model_name_or_path="EleutherAI/gpt-j-6b"
         approach="weight_only"
         extra_cmd=$extra_cmd" --woq_algo RTN --woq_bits 4 --woq_group_size 128 --woq_scheme asym --woq_enable_mse_search"
+    elif [ "${topology}" = "gpt_j_woq_gptq_debug_int4" ]; then
+        model_name_or_path="EleutherAI/gpt-j-6b"
+        approach="weight_only"
+        extra_cmd=$extra_cmd" --woq_algo GPTQ --woq_bits 4 --woq_group_size 128 --woq_scheme asym --gptq_use_max_length --gptq_debug"
     elif [ "${topology}" = "falcon_7b_sq" ]; then
         model_name_or_path="tiiuae/falcon-7b-instruct"
         extra_cmd=$extra_cmd" --sq --alpha 0.5"
+    elif [ "${topology}" = "falcon_7b_woq_gptq_debug_int4" ]; then
+        model_name_or_path="tiiuae/falcon-7b-instruct"
+        approach="weight_only"
+        extra_cmd=$extra_cmd" --woq_algo GPTQ --woq_bits 4 --woq_group_size 128 --woq_scheme asym --gptq_use_max_length --gptq_debug"
     fi
 
     python -u run_clm_no_trainer.py \

diff --git a/neural_compressor/version.py b/neural_compressor/version.py
@@ -15,4 +15,4 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """Intel® Neural Compressor: An open-source Python library supporting popular model compression techniques."""
-__version__ = "2.4"
+__version__ = "2.4.1"