intel · WeiweiZhang1 · Aug 27, 2024 · Aug 19, 2024 · Aug 19, 2024 · Aug 19, 2024
diff --git a/README.md b/README.md
@@ -188,6 +188,8 @@ Please note that an asterisk (*) indicates third-party quantized models, which m
 
 Model                                | Supported                                                           |
 |--------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| microsoft/Phi-3-vision-128k-instruct  |  [recipe](./examples/multimodal-modeling/Phi-3-vision/run_autoround.sh)
+| Qwen/Qwen-VL                          |  [accuracy](./examples/multimodal-modeling/Qwen-VL/README.md), [recipe](./examples/multimodal-modeling/Qwen-VL/run_autoround.sh)
 | meta-llama/Meta-Llama-3.1-70B-Instruct       | [recipe](https://huggingface.co/Intel/Meta-Llama-3.1-70B-Instruct-int4-inc)    |
 | meta-llama/Meta-Llama-3.1-8B-Instruct        | [model-kaitchup-autogptq-int4*](https://huggingface.co/kaitchup/Meta-Llama-3.1-8B-Instruct-autoround-gptq-4bit-asym), [model-kaitchup-autogptq-sym-int4*](https://huggingface.co/kaitchup/Meta-Llama-3.1-8B-Instruct-autoround-gptq-4bit-sym), [recipe](https://huggingface.co/Intel/Meta-Llama-3.1-8B-Instruct-int4-inc)           |
 | meta-llama/Meta-Llama-3.1-8B                 | [model-kaitchup-autogptq-sym-int4*](https://huggingface.co/kaitchup/Meta-Llama-3.1-8B-autoround-gptq-4bit-sym)     |

diff --git a/examples/multimodal-modeling/Llava/README.md b/examples/multimodal-modeling/Llava/README.md
@@ -6,6 +6,8 @@ This document presents step-by-step instructions for auto-round.
 
 In this example, we introduce an straight-forward way to execute quantization on some popular multimodal models such as LLaVA. 
 
+Please note that LLAVA quantization is currently an **experimental feature** and does not yet support inference on various devices after export.
+
 ## Install
 If you are not using Linux, do NOT proceed, see instructions for [macOS](https://github.com/haotian-liu/LLaVA/blob/main/docs/macOS.md) and [Windows](https://github.com/haotian-liu/LLaVA/blob/main/docs/Windows.md).
 
@@ -62,11 +64,11 @@ Include the flag `--adam`. Note that AdamW is less effective than sign gradient
 
 - **Running on Intel Gaudi2**
 ```bash
-bash run_autoround_on_gaudi.sh
+bash run_autoround.sh
 ```
 
 ## 4. Results
-Using [COCO 2017](https://cocodataset.org/) and [LLaVA-Instruct-150K](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K) datasets for quantization calibration, and TextVQA dataset for evaluation. When the vision components are not involved in quantization, it is able to achieve accuracy loss within 1%. The results for LLava-7b are as follows:
+Using [COCO 2017](https://cocodataset.org/) and [LLaVA-Instruct-150K](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K) datasets for quantization calibration, and TextVQA dataset for evaluation. When the vision components are not involved in quantization, it is able to achieve accuracy loss within 1%. The results for fake quantized LLava-7b are as follows:
 | Model | Config | Precision | Hyperparameter | Accuracy% | Relative drop |
 |  :----: | :----: | :----: | :----: | :----: | :----: |
 | liuhaotian/llava-v1.5-7b | - | FP16 | - | 58.21 | - |
@@ -96,9 +98,3 @@ If you find SignRound useful for your research, please cite our paper:
 ```
 
 
-
-
-
-
-
-
diff --git a/examples/multimodal-modeling/Phi-3-vision/README.md b/examples/multimodal-modeling/Phi-3-vision/README.md
@@ -16,6 +16,8 @@ COCO: [train2017](http://images.cocodataset.org/zips/train2017.zip), and unzip t
 
 
 ## 2. Run Examples
+PyTorch 1.8 or higher version is needed
+
 Enter into the examples folder and install lm-eval to run the evaluation
 ```bash
 pip install -r requirements.txt
@@ -47,13 +49,75 @@ Include the flag `--adam`. Note that AdamW is less effective than sign gradient
 
 - **Running on Intel Gaudi2**
 ```bash
-bash run_autoround_on_gaudi.sh
+bash run_autoround.sh
 ```
 
 
-## 3. Environment
+## 3. Run Inference
+
+```python
+from PIL import Image
+import requests
+import io
+from transformers import AutoModelForCausalLM
+from transformers import AutoProcessor
+from auto_round.auto_quantizer import AutoHfQuantizer
+quantized_model_path = "./tmp_autoround"
+model = AutoModelForCausalLM.from_pretrained(quantized_model_path, device_map="auto", trust_remote_code=True, torch_dtype="auto", _attn_implementation='flash_attention_2') # use _attn_implementation='eager' to disable flash attention
+
+processor = AutoProcessor.from_pretrained(quantized_model_path, trust_remote_code=True)
+
+messages = [ \
+    {"role": "user", "content": "<|image_1|>\nWhat is shown in this image?"}, \
+    {"role": "assistant", "content": "The chart displays the percentage of respondents who agree with various statements about their preparedness for meetings. It shows five categories: 'Having clear and pre-defined goals for meetings', 'Knowing where to find the information I need for a meeting', 'Understanding my exact role and responsibilities when I'm invited', 'Having tools to manage admin tasks like note-taking or summarization', and 'Having more focus time to sufficiently prepare for meetings'. Each category has an associated bar indicating the level of agreement, measured on a scale from 0% to 100%."}, \
+    {"role": "user", "content": "Provide insightful questions to spark discussion."}]
+
+url = "https://assets-c4akfrf5b4d3f4b7.z01.azurefd.net/assets/2024/04/BMDataViz_661fb89f3845e.png" 
+# image = Image.open(requests.get(url, stream=True).raw)
+image = Image.open(io.BytesIO(requests.get(url, stream=True).content))
+
+prompt = processor.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+
+inputs = processor(prompt, [image], return_tensors="pt").to("cuda:0")
+
+generation_args = {
+    "max_new_tokens": 50,
+    "temperature": 0.0,
+    "do_sample": False,
+}
+
+generate_ids = model.generate(**inputs, eos_token_id=processor.tokenizer.eos_token_id, **generation_args) 
+
+# remove input tokens 
+generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]
+response = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] 
+
+print(response)
+# 1. How does the level of agreement on each statement reflect the overall preparedness of respondents for meetings?
+# 2. What are the most and least agreed-upon statements, and why might that be the case?
+# 3.
+```
+<!-- 
+
+## 4. Results
+Using [COCO 2017](https://cocodataset.org/) and [LLaVA-Instruct-150K](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K) datasets for quantization calibration, and lm_eval dataset for evaluation. please follow the [recipe](./run_autoround.sh) and [evaluate script](./run_eval.sh). The results for Phi-3-vision-128k-instruct are as follows:
+| Metric         | bf16   | INT4   |
+|----------------|--------|--------|
+| avg            | 0.6014 | 0.5940 |
+| mmlu           | 0.6369 | 0.6310 |
+| lambada_openai | 0.6487 | 0.6406 |
+| hellaswag      | 0.5585 | 0.5483 |
+| winogrande     | 0.7395 | 0.7451 |
+| piqa           | 0.7954 | 0.7889 |
+| truthfulqa_mc1 | 0.3084 | 0.2987 |
+| openbookqa     | 0.3580 | 0.3600 |
+| boolq          | 0.8532 | 0.8557 |
+| arc_easy       | 0.8371 | 0.8346 |
+| arc_challenge  | 0.5572 | 0.5469 |
+| cmmlu          | 0.4074 | 0.3950 |
+| ceval          | 0.4027 | 0.4012 |
+| gsm8k          | 0.7157 | 0.6755 | -->
 
-PyTorch 1.8 or higher version is needed
 
 
 ## Reference
@@ -72,3 +136,4 @@ If you find SignRound useful for your research, please cite our paper:
 
 
 
+
diff --git a/examples/multimodal-modeling/Phi-3-vision/eval_042/evaluation.py b/examples/multimodal-modeling/Phi-3-vision/eval_042/evaluation.py
@@ -576,6 +576,10 @@ def evaluate(
     parser.add_argument(
         "--eval_bs", default=1,
     )
+    parser.add_argument(
+        "--device", default="cuda:0",
+        help="PyTorch device (e.g. cpu/cuda:0/hpu) for evaluation."
+    )
     parser.add_argument(
         "--trust_remote_code", action='store_true',
         help="Whether to enable trust_remote_code"
@@ -600,17 +604,20 @@ def evaluate(
             model_args += f",autogptq=True,gptq_use_triton=True"
     if args.trust_remote_code:
         model_args += f",trust_remote_code=True"
-        
+    model_args += ",dtype=bfloat16"
     test_tasks = args.tasks
     if isinstance(test_tasks, str):
         test_tasks = test_tasks.split(',')
     model_name = args.model_name.rstrip('/')
     from lm_eval.utils import make_table
-    result = simple_evaluate(model="hf",
-                             model_args=model_args,
-                             tasks=test_tasks,
-                             batch_size=args.eval_bs)
+    with torch.cuda.amp.autocast():
+        result = simple_evaluate(model="hf",
+                                model_args=model_args,
+                                tasks=test_tasks,
+                                device=args.device,
+                                batch_size=args.eval_bs)
     print(make_table(result))
 
     print("cost time: ", time.time() - s)
 
+
diff --git a/examples/multimodal-modeling/Phi-3-vision/main.py b/examples/multimodal-modeling/Phi-3-vision/main.py
@@ -464,3 +464,4 @@ def create_data_loader(dataset, batch_size=1, data_collator=None):
         from lm_eval.utils import make_table
 
         print(make_table(res))
+
diff --git a/examples/multimodal-modeling/Phi-3-vision/run_autoround.sh b/examples/multimodal-modeling/Phi-3-vision/run_autoround.sh
@@ -6,6 +6,9 @@ CUDA_VISIBLE_DEVICES=$device \
 python3 main.py \
 --model_name=$model_name \
 --deployment_device 'auto_round' \
+--nsamples 512 \
+--model_dtype bf16 \
 --image_folder /PATH/TO/coco/images/train2017 \
 --question_file /PATH/TO/llava_v1_5_mix665k.json \
 --output_dir "./tmp_autoround"
+
diff --git a/examples/multimodal-modeling/Phi-3-vision/run_autoround_on_gaudi.sh b/examples/multimodal-modeling/Phi-3-vision/run_autoround_on_gaudi.sh
diff --git a/examples/multimodal-modeling/Phi-3-vision/run_eval.sh b/examples/multimodal-modeling/Phi-3-vision/run_eval.sh
@@ -1,48 +1,11 @@
-export https_proxy=http://proxy.ims.intel.com:911
-export http_proxy=http://proxy.ims.intel.com:911
-export HF_HOME=/home/weiweiz1/.cache/
+#!/bin/bash
+set -x
+device=0
 
-#  Mistral-7B-Instruct-v0.2
-# device=3
-# Baichuan2-7B-Chat  Phi-3-mini-4k-instruct
-# Llama-2-7b-chat-hf 
-# lambada_openai,hellaswag,piqa,winogrande,truthfulqa_mc1,openbookqa,boolq,arc_easy,arc_challenge,mmlu,
-# ceval-valid,cmmlu
-# dir=/data5/zww/test_faster/
-# dir=/models
-# for model in Phi-3-mini-4k-instruct Meta-Llama-3-8B-Instruct
-# do
-#     echo ${model}/default
-#     CUDA_VISIBLE_DEVICES=$device \
-#     python3 eval_042/evaluation.py --model_name ${dir}${model}_default/$model-autoround-w4g128-gpu \
-#     --trust_remote_code \
-#     --eval_bs 16 --tasks gsm8k,ceval-valid,cmmlu \
-#     2>&1| tee -a /data4/zww/test_faster/rounding_${model}_rtn.txt
-#     echo ${model}/rtn
-# done&
-
-device=2
-dir=/data4/zww/tmp/
-# dir=/data5/models/
-for model in Phi-3-vision-128k-instruct
-do
-    echo ${model}
-    CUDA_VISIBLE_DEVICES=$device \
-    python3 eval_042/evaluation.py --model_name ${dir}/$model-autoround-w4g128-round \
-    --trust_remote_code \
-    --eval_bs 16 --tasks lambada_openai \
-    2>&1| tee -a /data4/zww/test_faster/rounding_${model}.txt
-    echo ${model}
-done
-# dir=/data5/zww/test_faster/
-# for model in Phi-3-mini-4k-instruct Mistral-7B-Instruct-v0.2
-# do
-#     echo ${model}/rtn
-#     CUDA_VISIBLE_DEVICES=$device \
-#     python3 eval_042/evaluation.py --model_name ${dir}${model}_rtn/$model-autoround-w4g128-gpu \
-#     --trust_remote_code \
-#     --eval_bs 16 --tasks lambada_openai,hellaswag,piqa,winogrande,truthfulqa_mc1,openbookqa,boolq,arc_easy,arc_challenge,mmlu,gsm8k \
-#     2>&1| tee -a /data4/zww/test_faster/rounding_${model}_rtn.txt
-#     echo ${model}/rtn
-# done
+model_path='./tmp_autoround'
+model=Phi-3-vision-128k-instruct
 
+CUDA_VISIBLE_DEVICES=$device python3 eval_042/evaluation.py \
+--model_name ${model_path}/${model} \
+--trust_remote_code \
+--eval_bs 16
diff --git a/examples/multimodal-modeling/Qwen-VL/README.md b/examples/multimodal-modeling/Qwen-VL/README.md
@@ -100,17 +100,68 @@ Include the flag `--adam`. Note that AdamW is less effective than sign gradient
 
 - **Running on Intel Gaudi2**
 ```bash
-bash run_autoround_on_gaudi.sh
+bash run_autoround.sh
+```
+
+## 3. run inference
+
+```python
+  from transformers import AutoModelForCausalLM, AutoTokenizer
+  from transformers.generation import GenerationConfig
+  import torch
+  from transformers import set_seed
+  set_seed(1234)
+  from auto_round.auto_quantizer import AutoHfQuantizer
+  quantized_model_path = "./tmp_autoround"
+  tokenizer = AutoTokenizer.from_pretrained(quantized_model_path, trust_remote_code=True)
+  # use bf16
+  model = AutoModelForCausalLM.from_pretrained(quantized_model_path, device_map="auto", trust_remote_code=True, bf16=True).eval()
+  # use fp16
+  # model = AutoModelForCausalLM.from_pretrained(quantized_model_path, device_map="auto", trust_remote_code=True, fp16=True).eval()
+  # use cpu only
+  # model = AutoModelForCausalLM.from_pretrained(quantized_model_path, device_map="cpu", trust_remote_code=True).eval()
+  # use cuda device
+  # model = AutoModelForCausalLM.from_pretrained(quantized_model_path, device_map="cuda", trust_remote_code=True).eval()
+  query = tokenizer.from_list_format([{'image': 'https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg'}, \
+      {'text': 'Generate the caption in English with grounding:'}, \
+  ])
+  inputs = tokenizer(query, return_tensors='pt')
+  inputs = inputs.to(model.device)
+  with torch.cuda.amp.autocast(): 
+      pred = model.generate(**inputs)
+  response = tokenizer.decode(pred.cpu()[0], skip_special_tokens=False)
+  print(response)
+  # <img>https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg</img>Generate the caption in English with grounding:<ref> Woman</ref><box>(451,379),(731,806)</box> and<ref> her dog</ref><box>(219,424),(576,896)</box> playing on the beach<|endoftext|>
+  image = tokenizer.draw_bbox_on_latest_picture(response)
+  if image:
+    image.save('2.jpg')
+  else:
+    print("no box")
+
 ```
 
 
 ## 4. Results
-Using [COCO 2017](https://cocodataset.org/) and [LLaVA-Instruct-150K](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K) datasets for quantization calibration, and TextVQA dataset for evaluation. It is able to achieve accuracy loss within 1% Whether or not the visual component is quantified. The results for Qwen-VL are as follows:
-| Model | Config | Precision | Hyperparameter | Accuracy% | Relative drop |
-|  :----: | :----: | :----: | :----: | :----: | :----: |
-| Qwen/Qwen-VL | - | FP16 | - | 63.94 | - |
-| Qwen/Qwen-VL | W4G128 | FP16 | with vision | 63.68 | -0.41% |
-| Qwen/Qwen-VL | W4G128 | FP16 | w/o vision | 63.73 | -0.33% |
+Using [COCO 2017](https://cocodataset.org/) and [LLaVA-Instruct-150K](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K) datasets for quantization calibration, and TextVQA dataset for evaluation. please follow the [recipe](./run_autoround.sh) and [evaluate script](./run_eval.sh). The results for Qwen-VL are as follows:
+| Metric         | bf16   | INT4   |
+|----------------|--------|--------|
+| avg            | 0.5628 | 0.5589 |
+| paper-avg      | 0.5603 | 0.5611 |
+| mmlu           | 0.4828 | 0.4639 |
+| lambada_openai | 0.6782 | 0.6664 |
+| hellaswag      | 0.5593 | 0.5487 |
+| winogrande     | 0.6827 | 0.6875 |
+| piqa           | 0.7786 | 0.7748 |
+| truthfulqa_mc1 | 0.2876 | 0.2901 |
+| openbookqa     | 0.2880 | 0.2940 |
+| boolq          | 0.7012 | 0.7318 |
+| arc_easy       | 0.7201 | 0.7327 |
+| arc_challenge  | 0.4249 | 0.4206 |
+| cmmlu          | 0.4798 | 0.4618 |
+| ceval          | 0.4814 | 0.4569 |
+| textVQA        | 0.6402 | 0.6379 |
+| scienceVQA     | 0.6748 | 0.6574 |
+
 
 
 ## 5. Environment
@@ -136,3 +187,4 @@ If you find SignRound useful for your research, please cite our paper:
 
 
 
+
diff --git a/examples/multimodal-modeling/Qwen-VL/eval_042/__init__.py b/examples/multimodal-modeling/Qwen-VL/eval_042/__init__.py
Original file line number	Diff line number	Diff line change
Expand Up		@@ -464,3 +464,4 @@ def create_data_loader(dataset, batch_size=1, data_collator=None):
		from lm_eval.utils import make_table

		print(make_table(res))