In this tutorial, We will first present a list of examples to introduce the usage of lmdeploy.pipeline
.
Then, we will describe the pipeline API in detail.
An example using default parameters:
from lmdeploy import pipeline
pipe = pipeline('internlm/internlm-chat-7b')
response = pipe(['Hi, pls intro yourself', 'Shanghai is'])
print(response)
An example showing how to set tensor parallel num:
from lmdeploy import pipeline, TurbomindEngineConfig
backend_config = TurbomindEngineConfig(tp=2)
pipe = pipeline('internlm/internlm-chat-7b',
backend_config=backend_config)
response = pipe(['Hi, pls intro yourself', 'Shanghai is'])
print(response)
An example for setting sampling parameters:
from lmdeploy import pipeline, GenerationConfig, TurbomindEngineConfig
backend_config = TurbomindEngineConfig(tp=2)
gen_config = GenerationConfig(top_p=0.8,
top_k=40,
temperature=0.8,
max_new_tokens=1024)
pipe = pipeline('internlm/internlm-chat-7b',
backend_config=backend_config)
response = pipe(['Hi, pls intro yourself', 'Shanghai is'],
gen_config=gen_config)
print(response)
An example for OpenAI format prompt input:
from lmdeploy import pipeline, GenerationConfig, TurbomindEngineConfig
backend_config = TurbomindEngineConfig(tp=2)
gen_config = GenerationConfig(top_p=0.8,
top_k=40,
temperature=0.8,
max_new_tokens=1024)
pipe = pipeline('internlm/internlm-chat-7b',
backend_config=backend_config)
prompts = [[{
'role': 'user',
'content': 'Hi, pls intro yourself'
}], [{
'role': 'user',
'content': 'Shanghai is'
}]]
response = pipe(prompts,
gen_config=gen_config)
print(response)
Below is an example for pytorch backend. Please install triton first.
pip install triton>=2.1.0
from lmdeploy import pipeline, GenerationConfig, PytorchEngineConfig
backend_config = PytorchEngineConfig(session_len=2024)
gen_config = GenerationConfig(top_p=0.8,
top_k=40,
temperature=0.8,
max_new_tokens=1024)
pipe = pipeline('internlm/internlm-chat-7b',
backend_config=backend_config)
prompts = [[{
'role': 'user',
'content': 'Hi, pls intro yourself'
}], [{
'role': 'user',
'content': 'Shanghai is'
}]]
response = pipe(prompts, gen_config=gen_config)
print(response)
The pipeline
function is a higher-level API designed for users to easily instantiate and use the AsyncEngine.
Parameter | Type | Description | Default |
---|---|---|---|
model_path | str | Path to the model. Can be a path to a local directory storing a Turbomind model, or a model_id for models hosted on huggingface.co. | N/A |
model_name | Optional[str] | Name of the model when the model_path points to a Pytorch model on huggingface.co. | None |
backend_config | TurbomindEngineConfig | PytorchEngineConfig | None | Configuration object for the backend. It can be either TurbomindEngineConfig or PytorchEngineConfig depending on the backend chosen. | None, running turbomind backend by default |
chat_template_config | Optional[ChatTemplateConfig] | Configuration for chat template. | None |
log_level | str | The level of logging. | 'ERROR' |
Parameter Name | Data Type | Default Value | Description |
---|---|---|---|
prompts | List[str] | None | A batch of prompts. |
gen_config | GenerationConfig or None | None | An instance of GenerationConfig. Default is None. |
do_preprocess | bool | True | Whether to pre-process the messages. Default is True, which means chat_template will be applied. |
request_output_len | int | 512 | The number of output tokens. This parameter will be deprecated. Please use the gen_config parameter instead. |
top_k | int | 40 | The number of the highest probability vocabulary tokens to keep for top-k-filtering. This parameter will be deprecated. Please use the gen_config parameter instead. |
top_p | float | 0.8 | If set to a float < 1, only the smallest set of most probable tokens with probabilities that add up to top_p or higher are kept for generation. This parameter will be deprecated. Please use the gen_config parameter instead. |
temperature | float | 0.8 | Used to modulate the next token probability. This parameter will be deprecated. Please use the gen_config parameter instead. |
repetition_penalty | float | 1.0 | The parameter for repetition penalty. 1.0 means no penalty. This parameter will be deprecated. Please use the gen_config parameter instead. |
ignore_eos | bool | False | Indicator for ignoring end-of-string (eos). This parameter will be deprecated. Please use the gen_config parameter instead. |
This class provides the configuration parameters for TurboMind backend.
Parameter | Type | Description | Default |
---|---|---|---|
model_name | str, Optional | The chat template name of the deployed model | None |
model_format | str, Optional | The layout of the deployed model. Can be one of the following values: hf, llama, awq. | None |
tp | int | The number of GPU cards used in tensor parallelism. | 1 |
session_len | int, Optional | The maximum session length of a sequence. | None |
max_batch_size | int | The maximum batch size during inference. | 128 |
cache_max_entry_count | float | The percentage of GPU memory occupied by the k/v cache. | 0.5 |
quant_policy | int | Set it to 4 when k/v is quantized into 8 bits. | 0 |
rope_scaling_factor | float | Scaling factor used for dynamic ntk. TurboMind follows the implementation of transformer LlamaAttention. | 0.0 |
use_logn_attn | bool | Whether or not to use logarithmic attention. | False |
This class provides the configuration parameters for Pytorch backend.
Parameter | Type | Description | Default |
---|---|---|---|
model_name | str | The chat template name of the deployed model | '' |
tp | int | Tensor Parallelism. | 1 |
session_len | int | Maximum session length. | None |
max_batch_size | int | Maximum batch size. | 128 |
eviction_type | str | Action to perform when kv cache is full. Options are ['recompute', 'copy']. | 'recompute' |
prefill_interval | int | Interval to perform prefill. | 16 |
block_size | int | Paging cache block size. | 64 |
num_cpu_blocks | int | Number of CPU blocks. If the number is 0, cache would be allocated according to the current environment. | 0 |
num_gpu_blocks | int | Number of GPU blocks. If the number is 0, cache would be allocated according to the current environment. | 0 |
adapters | dict | The path configs to lora adapters. | None |
This class contains the generation parameters used by inference engines.
Parameter | Type | Description | Default |
---|---|---|---|
n | int | Number of chat completion choices to generate for each input message. Currently, only 1 is supported | 1 |
max_new_tokens | int | Maximum number of tokens that can be generated in chat completion. | 512 |
top_p | float | Nucleus sampling, where the model considers the tokens with top_p probability mass. | 1.0 |
top_k | int | The model considers the top_k tokens with the highest probability. | 1 |
temperature | float | Sampling temperature. | 0.8 |
repetition_penalty | float | Penalty to prevent the model from generating repeated words or phrases. A value larger than 1 discourages repetition. | 1.0 |
ignore_eos | bool | Indicator to ignore the eos_token_id or not. | False |
random_seed | int | Seed used when sampling a token. | None |
stop_words | List[str] | Words that stop generating further tokens. | None |
bad_words | List[str] | Words that the engine will never generate. | None |
- RuntimeError: context has already been set. If you got this for tp>1 in pytorch backend. Please make sure the python script has following
Generally, in the context of multi-threading or multi-processing, it might be necessary to ensure that initialization code is executed only once. In this case,
if __name__ == '__main__':
if __name__ == '__main__':
can help to ensure that these initialization codes are run only in the main program, and not repeated in each newly created process or thread.