Prompt Lookup decoding is assested-generation technique where the draft model is replaced with simple string matching the prompt to generate candidate token sequences. This method highly effective for input grounded generation (summarization, document QA, multi-turn chat, code editing), where there is high n-gram overlap between LLM input (prompt) and LLM output. This could be entity names, phrases, or code chunks that the LLM directly copies from the input while generating the output. Prompt lookup exploits this pattern to speed up autoregressive decoding in LLMs. This results in significant speedups with no effect on output quality.
This example showcases inference of text-generation Large Language Models (LLMs): chatglm
, LLaMA
, Qwen
and other models with the same signature. The application doesn't have many configuration options to encourage the reader to explore and modify the source code. Loading openvino_tokenizers
to ov::Core
enables tokenization. Run optimum-cli
to generate IRs for the samples. There is also a Jupyter notebook which provides an example of LLM-powered Chatbot in Python.
The --upgrade-strategy eager
option is needed to ensure optimum-intel
is upgraded to the latest version.
It's not required to install ../../requirements.txt for deployment if the model has already been exported.
source <INSTALL_DIR>/setupvars.sh
pip install --upgrade-strategy eager -r ../../requirements.txt
optimum-cli export openvino --trust-remote-code --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 TinyLlama-1.1B-Chat-v1.0
prompt_lookup_decoding_lm ./TinyLlama-1.1B-Chat-v1.0/ "return 0;"
Discrete GPUs (dGPUs) usually provide better performance compared to CPUs. It is recommended to run larger models on a dGPU with 32GB+ RAM. For example, the model meta-llama/Llama-2-13b-chat-hf can benefit from being run on a dGPU. Modify the source code to change the device for inference to the GPU.
See https://github.com/openvinotoolkit/openvino.genai/blob/master/src/README.md#supported-models for the list of supported models.
Example error:
UnicodeEncodeError: 'charmap' codec can't encode character '\u25aa' in position 0: character maps to <undefined>
If you encounter the error described in the example when sample is printing output to the Windows console, it is likely due to the default Windows encoding not supporting certain Unicode characters. To resolve this:
- Enable Unicode characters for Windows cmd - open
Region
settings fromControl panel
.Administrative
->Change system locale
->Beta: Use Unicode UTF-8 for worldwide language support
->OK
. Reboot. - Enable UTF-8 mode by setting environment variable
PYTHONIOENCODING="utf8"
.