Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: PyTorch is not linked with support for xpu devices #9768

Open
openvino-book opened this issue Dec 23, 2023 · 18 comments
Open

RuntimeError: PyTorch is not linked with support for xpu devices #9768

openvino-book opened this issue Dec 23, 2023 · 18 comments

Comments

@openvino-book
Copy link

RuntimeError: PyTorch is not linked with support for xpu devices

Install BigDL GPU version on Windows 11 as https://bigdl.readthedocs.io/en/latest/doc/LLM/Overview/install_gpu.html
image

when execute the code as below, the model is chatglm3-6b

import torch
import time
import argparse
import numpy as np

from bigdl.llm.transformers import AutoModel
from transformers import AutoTokenizer

# you could tune the prompt based on your own model,
# here the prompt tuning refers to https://github.com/THUDM/ChatGLM3/blob/main/PROMPT.md
CHATGLM_V3_PROMPT_FORMAT = "<|user|>\n{prompt}\n<|assistant|>"

if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='Predict Tokens using `generate()` API for ChatGLM3 model')
    parser.add_argument('--repo-id-or-model-path', type=str, default="d:/chatglm3-6b",
                        help='The huggingface repo id for the ChatGLM3 model to be downloaded'
                             ', or the path to the huggingface checkpoint folder')
    parser.add_argument('--prompt', type=str, default="AI是什么?",
                        help='Prompt to infer')
    parser.add_argument('--n-predict', type=int, default=32,
                        help='Max tokens to predict')

    args = parser.parse_args()
    model_path = args.repo_id_or_model_path

    # Load model in 4 bit,
    # which convert the relevant layers in the model into INT4 format
    model = AutoModel.from_pretrained(model_path,
                                      load_in_4bit=True,
                                      trust_remote_code=True)
    
    model.save_low_bit("bigdl_chatglm3-6b-q4_0.bin")
    #run the optimized model on Intel GPU
    model = model.to('xpu')

    # Load tokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_path,
                                              trust_remote_code=True)
    
    # Generate predicted tokens
    with torch.inference_mode():
        prompt = CHATGLM_V3_PROMPT_FORMAT.format(prompt=args.prompt)
        input_ids = tokenizer.encode(prompt, return_tensors="pt")
        st = time.time()
        # if your selected model is capable of utilizing previous key/value attentions
        # to enhance decoding speed, but has `"use_cache": false` in its model config,
        # it is important to set `use_cache=True` explicitly in the `generate` function
        # to obtain optimal performance with BigDL-LLM INT4 optimizations
        output = model.generate(input_ids,
                                max_new_tokens=args.n_predict)
        end = time.time()
        output_str = tokenizer.decode(output[0], skip_special_tokens=True)
        print(f'Inference time: {end-st} s')
        print('-'*20, 'Prompt', '-'*20)
        print(prompt)
        print('-'*20, 'Output', '-'*20)
        print(output_str)

the Error will occur:
image

Does BigDL support to run ChatGLM3-6b on ARC GPU right now?

@jason-dai
Copy link
Contributor

RuntimeError: PyTorch is not linked with support for xpu devices

It seems the installed PyTorch does not support XPU. Can you share the specific PyTorch version installed, and try if it works with Arc GPU (even without using BigDL)?

Does BigDL support to run ChatGLM3-6b on ARC GPU right now?

Yes, it supports ChatGLM3-6B on Arc GPU

@MeouSker77
Copy link
Contributor

MeouSker77 commented Dec 25, 2023

add import intel_extension_for_pytorch as ipex before .to('xpu'), although we don't use ipex manually, it is still needed to run GPU

@MeouSker77
Copy link
Contributor

and also add input_ids = input_ids.to('xpu')

@openvino-book
Copy link
Author

and also add input_ids = input_ids.to('xpu')

Thank you, @MeouSker77 , it works and solve the RuntimeError: PyTorch is not linked with support for xpu devices

The code is modified as below:

import time
from bigdl.llm.transformers import AutoModel
from transformers import AutoTokenizer
import intel_extension_for_pytorch as ipex

CHATGLM_V3_PROMPT_FORMAT = "<|user|>\n{prompt}\n<|assistant|>"

# 请指定chatglm3-6b的本地路径
model_path = "d:/chatglm3-6b"

# 载入ChatGLM3-6B模型并实现INT4量化
model = AutoModel.from_pretrained(model_path,
                                  load_in_4bit=True,
                                  trust_remote_code=True)
# run the optimized model on Intel GPU
model = model.to('xpu')

# 载入tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path,
                                          trust_remote_code=True)
# 制作ChatGLM3格式提示词    
prompt = CHATGLM_V3_PROMPT_FORMAT.format(prompt="What is Intel?")

# 对提示词编码
input_ids = tokenizer.encode(prompt, return_tensors="pt")
input_ids = input_ids.to('xpu')
st = time.time()
# 执行推理计算,生成Tokens
output = model.generate(input_ids,max_new_tokens=32)
end = time.time()
# 对生成Tokens解码并显示
output_str = tokenizer.decode(output[0], skip_special_tokens=True)
print(f'Inference time: {end-st} s')
print('-'*20, 'Prompt', '-'*20)
print(prompt)
print('-'*20, 'Output', '-'*20)
print(output_str)

However, another runtime error happen:
RuntimeError: The number of work-items in each dimension of a work-group cannot exceed {512, 512, 512} for this device -54 (PI_ERROR_INVALID_WORK_GROUP_SIZE)

(llm_gpu) D:>python chatglm3_infer_gpu.py
C:\Users\OV\anaconda3\envs\llm_gpu\lib\site-packages\torchvision\io\image.py:13: UserWarning: Failed to load image Python extension: 'Could not find module 'C:\Users\OV\anaconda3\envs\llm_gpu\Lib\site-packages\torchvision\image.pyd' (or one of its dependencies). Try using the full path with constructor syntax.'If you don't plan on using image functionality from torchvision.io, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have libjpeg or libpng installed before building torchvision from source?
warn(
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████| 7/7 [00:04<00:00, 1.57it/s]
2023-12-26 09:55:56,907 - INFO - Converting the current model to sym_int4 format......
Traceback (most recent call last):
File "D:\chatglm3_infer_gpu.py", line 29, in
output = model.generate(input_ids,max_new_tokens=32)
File "C:\Users\OV\anaconda3\envs\llm_gpu\lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "C:\Users\OV\anaconda3\envs\llm_gpu\lib\site-packages\transformers\generation\utils.py", line 1538, in generate
return self.greedy_search(
File "C:\Users\OV\anaconda3\envs\llm_gpu\lib\site-packages\transformers\generation\utils.py", line 2362, in greedy_search
outputs = self(
File "C:\Users\OV\anaconda3\envs\llm_gpu\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\Users\OV\anaconda3\envs\llm_gpu\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\OV/.cache\huggingface\modules\transformers_modules\chatglm3-6b\modeling_chatglm.py", line 937, in forward
transformer_outputs = self.transformer(
File "C:\Users\OV\anaconda3\envs\llm_gpu\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\Users\OV\anaconda3\envs\llm_gpu\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\OV\anaconda3\envs\llm_gpu\lib\site-packages\bigdl\llm\transformers\models\chatglm2.py", line 152, in chatglm2_model_forward
hidden_states, presents, all_hidden_states, all_self_attentions = self.encoder(
File "C:\Users\OV\anaconda3\envs\llm_gpu\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\Users\OV\anaconda3\envs\llm_gpu\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\OV/.cache\huggingface\modules\transformers_modules\chatglm3-6b\modeling_chatglm.py", line 640, in forward
layer_ret = layer(
File "C:\Users\OV\anaconda3\envs\llm_gpu\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\Users\OV\anaconda3\envs\llm_gpu\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\OV/.cache\huggingface\modules\transformers_modules\chatglm3-6b\modeling_chatglm.py", line 542, in forward
layernorm_output = self.input_layernorm(hidden_states)
File "C:\Users\OV\anaconda3\envs\llm_gpu\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\Users\OV\anaconda3\envs\llm_gpu\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\OV\anaconda3\envs\llm_gpu\lib\site-packages\bigdl\llm\transformers\models\chatglm2.py", line 83, in chatglm_rms_norm_forward
result = linear_q4_0.fused_rms_norm(hidden_states,
RuntimeError: The number of work-items in each dimension of a work-group cannot exceed {512, 512, 512} for this device -54 (PI_ERROR_INVALID_WORK_GROUP_SIZE)

Could you tell me how to solve it to make ChatGLM3-6B run on the A770? Thank you very much in advance!

@MeouSker77
Copy link
Contributor

Can you try this code to get the device name of 'xpu:0'?

name = torch.xpu.get_device_name(0)
print(name)

I'm afraid the default xpu device is not A770

@openvino-book
Copy link
Author

Can you try this code to get the device name of 'xpu:0'?

name = torch.xpu.get_device_name(0)
print(name)

I'm afraid the default xpu device is not A770

When I run the code, I got the AttributeError:
** module 'torch' has no attribute 'xpu'**
module 'torch' has no attribute 'xpu'

@jason-dai
Copy link
Contributor

Add import intel_extension_for_pytorch as ipex?

@openvino-book
Copy link
Author

print(name)

1703643815331

@openvino-book
Copy link
Author

how to set the device as Iris Xe or Arc A770?

@MeouSker77
Copy link
Contributor

how to set the device as Iris Xe or Arc A770?

change all to('xpu') to to('xpu:1') to use A770

@openvino-book
Copy link
Author

xpu:1

import time
from bigdl.llm.transformers import AutoModel
from transformers import AutoTokenizer
import intel_extension_for_pytorch as ipex

CHATGLM_V3_PROMPT_FORMAT = "<|user|>\n{prompt}\n<|assistant|>"

# 请指定chatglm3-6b的本地路径
model_path = "d:/chatglm3-6b"

# 载入ChatGLM3-6B模型并实现INT4量化
model = AutoModel.from_pretrained(model_path,
                                  load_in_4bit=True,
                                  trust_remote_code=True)
# run the optimized model on Intel GPU
model = model.to('xpu:1')

# 载入tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path,
                                          trust_remote_code=True)
# 制作ChatGLM3格式提示词    
prompt = CHATGLM_V3_PROMPT_FORMAT.format(prompt="What is Intel?")

# 对提示词编码
input_ids = tokenizer.encode(prompt, return_tensors="pt")
input_ids = input_ids.to('xpu:1')
st = time.time()
# 执行推理计算,生成Tokens
output = model.generate(input_ids,max_new_tokens=32)
end = time.time()
# 对生成Tokens解码并显示
output_str = tokenizer.decode(output[0], skip_special_tokens=True)
print(f'Inference time: {end-st} s')
print('-'*20, 'Prompt', '-'*20)
print(prompt)
print('-'*20, 'Output', '-'*20)
print(output_str)

RuntimeError: could not create a primitive

(bigdl) D:>python chatglm3_infer_gpu.py
C:\Users\OV\anaconda3\envs\bigdl\lib\site-packages\torchvision\io\image.py:13: UserWarning: Failed to load image Python extension: 'Could not find module 'C:\Users\OV\anaconda3\envs\bigdl\Lib\site-packages\torchvision\image.pyd' (or one of its dependencies). Try using the full path with constructor syntax.'If you don't plan on using image functionality from torchvision.io, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have libjpeg or libpng installed before building torchvision from source?
warn(
Loading checkpoint shards: 100%|███████████████████████████████| 7/7 [00:04<00:00, 1.53it/s]
2023-12-27 10:32:52,956 - INFO - Converting the current model to sym_int4 format......
onednn_verbose,info,oneDNN v3.3.0 (commit 887fb044ccd6308ed1780a3863c2c6f5772c94b3)
onednn_verbose,info,cpu,runtime:threadpool,nthr:10
onednn_verbose,info,cpu,isa:Intel AVX2 with Intel DL Boost
onednn_verbose,info,gpu,runtime:DPC++
onednn_verbose,info,gpu,engine,0,backend:Level Zero,name:Intel(R) Iris(R) Xe Graphics,driver_version:1.3.26957,binary_kernels:enabled
onednn_verbose,info,gpu,engine,1,backend:Level Zero,name:Intel(R) Arc(TM) A770M Graphics,driver_version:1.3.26957,binary_kernels:enabled
onednn_verbose,info,graph,backend,0:dnnl_backend
onednn_verbose,info,experimental features are enabled
onednn_verbose,info,use batch_normalization stats one pass is enabled
onednn_verbose,primitive,info,template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time
onednn_verbose,graph,info,template:operation,engine,partition_id,partition_kind,op_names,data_formats,logical_tensors,fpmath_mode,backend,exec_time
onednn_verbose,common,error,level_zero,errcode 1879048196
Traceback (most recent call last):
File "D:\chatglm3_infer_gpu.py", line 29, in
output = model.generate(input_ids,max_new_tokens=32)
File "C:\Users\OV\anaconda3\envs\bigdl\lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "C:\Users\OV\anaconda3\envs\bigdl\lib\site-packages\transformers\generation\utils.py", line 1538, in generate
return self.greedy_search(
File "C:\Users\OV\anaconda3\envs\bigdl\lib\site-packages\transformers\generation\utils.py", line 2362, in greedy_search
outputs = self(
File "C:\Users\OV\anaconda3\envs\bigdl\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\Users\OV\anaconda3\envs\bigdl\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\OV/.cache\huggingface\modules\transformers_modules\chatglm3-6b\modeling_chatglm.py", line 937, in forward
transformer_outputs = self.transformer(
File "C:\Users\OV\anaconda3\envs\bigdl\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\Users\OV\anaconda3\envs\bigdl\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\OV\anaconda3\envs\bigdl\lib\site-packages\bigdl\llm\transformers\models\chatglm2.py", line 152, in chatglm2_model_forward
hidden_states, presents, all_hidden_states, all_self_attentions = self.encoder(
File "C:\Users\OV\anaconda3\envs\bigdl\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\Users\OV\anaconda3\envs\bigdl\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\OV/.cache\huggingface\modules\transformers_modules\chatglm3-6b\modeling_chatglm.py", line 640, in forward
layer_ret = layer(
File "C:\Users\OV\anaconda3\envs\bigdl\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\Users\OV\anaconda3\envs\bigdl\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\OV/.cache\huggingface\modules\transformers_modules\chatglm3-6b\modeling_chatglm.py", line 544, in forward
attention_output, kv_cache = self.self_attention(
File "C:\Users\OV\anaconda3\envs\bigdl\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\Users\OV\anaconda3\envs\bigdl\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\OV\anaconda3\envs\bigdl\lib\site-packages\bigdl\llm\transformers\models\chatglm2.py", line 353, in chatglm2_attention_forward_8eb45c
context_layer = self.core_attention(query_layer, key_layer, value_layer, attention_mask)
File "C:\Users\OV\anaconda3\envs\bigdl\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\Users\OV\anaconda3\envs\bigdl\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\OV\anaconda3\envs\bigdl\lib\site-packages\bigdl\llm\transformers\models\chatglm2.py", line 369, in core_attn_forward_8eb45c
context_layer = torch.nn.functional.scaled_dot_product_attention(query_layer,
RuntimeError: could not create a primitive

@MeouSker77
Copy link
Contributor

Sorry, on our Windows A770 machines, A770 are all the default xpu device, so we cannot reproduce this error.

You can change 'xpu:1' back to 'xpu' and add optimize_mode=False in from_pretrained to run it on iGPU

Or you can change 'xpu:1' back to 'xpu', and try set ONEAPI_DEVICE_SELECTOR=level_zero:1 before running this example, set ONEAPI_DEVICE_SELECTOR=level_zero:1 should set A770 to default device.

@JinBridger
Copy link
Contributor

Sorry, on our Windows A770 machines, A770 are all the default xpu device, so we cannot reproduce this error.

You can change 'xpu:1' back to 'xpu' and add optimize_mode=False in from_pretrained to run it on iGPU

Or you can change 'xpu:1' back to 'xpu', and try set ONEAPI_DEVICE_SELECTOR=level_zero:1 before running this example, set ONEAPI_DEVICE_SELECTOR=level_zero:1 should set A770 to default device.

Maybe we should test on laptop because A770M is GPU for Laptop. I'll try if I could reproduce this error on a Laptop.

@openvino-book
Copy link
Author

Sorry, on our Windows A770 machines, A770 are all the default xpu device, so we cannot reproduce this error.
You can change 'xpu:1' back to 'xpu' and add optimize_mode=False in from_pretrained to run it on iGPU
Or you can change 'xpu:1' back to 'xpu', and try set ONEAPI_DEVICE_SELECTOR=level_zero:1 before running this example, set ONEAPI_DEVICE_SELECTOR=level_zero:1 should set A770 to default device.

Maybe we should test on laptop because A770M is GPU for Laptop. I'll try if I could reproduce this error on a Laptop.

My machine is the NUC12 蝰蛇峡谷(Serpent Canyon) i7 12700H+Arc A770M

I change 'xpu:1' back to 'xpu' and set ONEAPI_DEVICE_SELECTOR=level_zero:1 -- It works!!! Thank you very much!!!
1703727325631

run the code

import time
from bigdl.llm.transformers import AutoModel
from transformers import AutoTokenizer
import intel_extension_for_pytorch as ipex
import torch

CHATGLM_V3_PROMPT_FORMAT = "<|user|>\n{prompt}\n<|assistant|>"

# 请指定chatglm3-6b的本地路径
model_path = "d:/chatglm3-6b"

# 载入ChatGLM3-6B模型并实现INT4量化
model = AutoModel.from_pretrained(model_path,
                                  load_in_4bit=True,
                                  trust_remote_code=True)
# run the optimized model on Intel GPU
model = model.to('xpu')

# 载入tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path,
                                          trust_remote_code=True)
# 制作ChatGLM3格式提示词    
prompt = CHATGLM_V3_PROMPT_FORMAT.format(prompt="What is Intel?")

# 对提示词编码
input_ids = tokenizer.encode(prompt, return_tensors="pt")
input_ids = input_ids.to('xpu')
st = time.time()
# 执行推理计算,生成Tokens
output = model.generate(input_ids,max_new_tokens=32)
end = time.time()
# 对生成Tokens解码并显示
output_str = tokenizer.decode(output[0], skip_special_tokens=True)
print(f'Inference time: {end-st} s')
print('-'*20, 'Prompt', '-'*20)
print(prompt)
print('-'*20, 'Output', '-'*20)
print(output_str)

1703727425889

@openvino-book
Copy link
Author

@JinBridger Could I ask one more question? I want to run chatglm3-6b on A770 by streamlit
model = model.to("xpu") can be added in the get_model(),
How do I add the input_ids = input_ids.to('xpu') ?

The complete code is attached below:

import streamlit as st
from bigdl.llm.transformers import AutoModel
from transformers import AutoTokenizer
import intel_extension_for_pytorch as ipex
import torch

# 设置页面标题、图标和布局
st.set_page_config(
    page_title="ChatGLM3-6B+BigDL-LLM演示",
    page_icon=":robot:",
    layout="wide"
)
# 请指定chatglm3-6b的本地路径
model_path = "d:/chatglm3-6b"

@st.cache_resource
def get_model():
    # 载入ChatGLM3-6B模型并实现INT4量化
    model = AutoModel.from_pretrained(model_path,
                                    load_in_4bit=True,
                                    trust_remote_code=True)
    model = model.to('xpu')
    # 载入tokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_path,
                                            trust_remote_code=True)
    return tokenizer, model

# 加载Chatglm3的model和tokenizer
tokenizer, model = get_model()

# 初始化历史记录和past key values
if "history" not in st.session_state:
    st.session_state.history = []
if "past_key_values" not in st.session_state:
    st.session_state.past_key_values = None

# 设置max_length、top_p和temperature
max_length = st.sidebar.slider("max_length", 0, 32768, 8192, step=1)
top_p = st.sidebar.slider("top_p", 0.0, 1.0, 0.8, step=0.01)
temperature = st.sidebar.slider("temperature", 0.0, 1.0, 0.6, step=0.01)

# 清理会话历史
buttonClean = st.sidebar.button("清理会话历史", key="clean")
if buttonClean:
    st.session_state.history = []
    st.session_state.past_key_values = None
    st.rerun()

# 渲染聊天历史记录
for i, message in enumerate(st.session_state.history):
    if message["role"] == "user":
        with st.chat_message(name="user", avatar="user"):
            st.markdown(message["content"])
    else:
        with st.chat_message(name="assistant", avatar="assistant"):
            st.markdown(message["content"])

# 输入框和输出框
with st.chat_message(name="user", avatar="user"):
    input_placeholder = st.empty()
with st.chat_message(name="assistant", avatar="assistant"):
    message_placeholder = st.empty()

# 获取用户输入
prompt_text = st.chat_input("请输入您的问题")

# 如果用户输入了内容,则生成回复
if prompt_text:

    input_placeholder.markdown(prompt_text)
    history = st.session_state.history
    past_key_values = st.session_state.past_key_values
    for response, history, past_key_values in model.stream_chat(
        tokenizer,
        prompt_text,
        history,
        past_key_values=past_key_values,
        max_length=max_length,
        top_p=top_p,
        temperature=temperature,
        return_past_key_values=True,
    ):
        message_placeholder.markdown(response)

    # 更新历史记录和past key values
    st.session_state.history = history
    st.session_state.past_key_values = past_key_values

@MeouSker77
Copy link
Contributor

don't worry, the stream_chat API will move input tokens to model's device automatically (here), so you just need to move model to xpu

@openvino-book
Copy link
Author

don't worry, the stream_chat API will move input tokens to model's device automatically (here), so you just need to move model to xpu

Yes!! Thank you very much for guidance! It works!!!
1703814340971

Tested sample code

import streamlit as st
from bigdl.llm.transformers import AutoModel
from transformers import AutoTokenizer
import intel_extension_for_pytorch as ipex
import torch

# 设置页面标题、图标和布局
st.set_page_config(
    page_title="ChatGLM3-6B+BigDL-LLM演示",
    page_icon=":robot:",
    layout="wide"
)
# 请指定chatglm3-6b的本地路径
model_path = "d:/chatglm3-6b"

@st.cache_resource
def get_model():
    # 载入ChatGLM3-6B模型并实现INT4量化
    model = AutoModel.from_pretrained(model_path,
                                    load_in_4bit=True,
                                    trust_remote_code=True)
    model = model.to('xpu')
    # 载入tokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_path,
                                            trust_remote_code=True)
    return tokenizer, model

# 加载Chatglm3的model和tokenizer
tokenizer, model = get_model()

# 初始化历史记录和past key values
if "history" not in st.session_state:
    st.session_state.history = []
if "past_key_values" not in st.session_state:
    st.session_state.past_key_values = None

# 设置max_length、top_p和temperature
max_length = st.sidebar.slider("max_length", 0, 32768, 8192, step=1)
top_p = st.sidebar.slider("top_p", 0.0, 1.0, 0.8, step=0.01)
temperature = st.sidebar.slider("temperature", 0.0, 1.0, 0.6, step=0.01)

# 清理会话历史
buttonClean = st.sidebar.button("清理会话历史", key="clean")
if buttonClean:
    st.session_state.history = []
    st.session_state.past_key_values = None
    st.rerun()

# 渲染聊天历史记录
for i, message in enumerate(st.session_state.history):
    if message["role"] == "user":
        with st.chat_message(name="user", avatar="user"):
            st.markdown(message["content"])
    else:
        with st.chat_message(name="assistant", avatar="assistant"):
            st.markdown(message["content"])

# 输入框和输出框
with st.chat_message(name="user", avatar="user"):
    input_placeholder = st.empty()
with st.chat_message(name="assistant", avatar="assistant"):
    message_placeholder = st.empty()

# 获取用户输入
prompt_text = st.chat_input("请输入您的问题")

# 如果用户输入了内容,则生成回复
if prompt_text:

    input_placeholder.markdown(prompt_text)
    history = st.session_state.history
    past_key_values = st.session_state.past_key_values
    for response, history, past_key_values in model.stream_chat(
        tokenizer,
        prompt_text,
        history,
        past_key_values=past_key_values,
        max_length=max_length,
        top_p=top_p,
        temperature=temperature,
        return_past_key_values=True,
    ):
        message_placeholder.markdown(response)

    # 更新历史记录和past key values
    st.session_state.history = history
    st.session_state.past_key_values = past_key_values

1703814512915

@rnwang04 rnwang04 mentioned this issue Jan 3, 2024
3 tasks
@openvino-book
Copy link
Author

openvino-book commented Jan 11, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants