Skip to content

Latest commit

 

History

History
195 lines (153 loc) · 14 KB

pipeline.md

File metadata and controls

195 lines (153 loc) · 14 KB

Inference Pipeline

In this tutorial, We will first present a list of examples to introduce the usage of lmdeploy.pipeline.

Then, we will describe the pipeline API in detail.

Usage

An example using default parameters:

from lmdeploy import pipeline

pipe = pipeline('internlm/internlm-chat-7b')
response = pipe(['Hi, pls intro yourself', 'Shanghai is'])
print(response)

An example showing how to set tensor parallel num:

from lmdeploy import pipeline, TurbomindEngineConfig

backend_config = TurbomindEngineConfig(tp=2)
pipe = pipeline('internlm/internlm-chat-7b',
                backend_config=backend_config)
response = pipe(['Hi, pls intro yourself', 'Shanghai is'])
print(response)

An example for setting sampling parameters:

from lmdeploy import pipeline, GenerationConfig, TurbomindEngineConfig

backend_config = TurbomindEngineConfig(tp=2)
gen_config = GenerationConfig(top_p=0.8,
                              top_k=40,
                              temperature=0.8,
                              max_new_tokens=1024)
pipe = pipeline('internlm/internlm-chat-7b',
                backend_config=backend_config)
response = pipe(['Hi, pls intro yourself', 'Shanghai is'],
                gen_config=gen_config)
print(response)

An example for OpenAI format prompt input:

from lmdeploy import pipeline, GenerationConfig, TurbomindEngineConfig

backend_config = TurbomindEngineConfig(tp=2)
gen_config = GenerationConfig(top_p=0.8,
                              top_k=40,
                              temperature=0.8,
                              max_new_tokens=1024)
pipe = pipeline('internlm/internlm-chat-7b',
                backend_config=backend_config)
prompts = [[{
    'role': 'user',
    'content': 'Hi, pls intro yourself'
}], [{
    'role': 'user',
    'content': 'Shanghai is'
}]]
response = pipe(prompts,
                gen_config=gen_config)
print(response)

Below is an example for pytorch backend. Please install triton first.

pip install triton>=2.1.0
from lmdeploy import pipeline, GenerationConfig, PytorchEngineConfig

backend_config = PytorchEngineConfig(session_len=2024)
gen_config = GenerationConfig(top_p=0.8,
                              top_k=40,
                              temperature=0.8,
                              max_new_tokens=1024)
pipe = pipeline('internlm/internlm-chat-7b',
                backend_config=backend_config)
prompts = [[{
    'role': 'user',
    'content': 'Hi, pls intro yourself'
}], [{
    'role': 'user',
    'content': 'Shanghai is'
}]]
response = pipe(prompts, gen_config=gen_config)
print(response)

pipeline API

The pipeline function is a higher-level API designed for users to easily instantiate and use the AsyncEngine.

Init parameters:

Parameter Type Description Default
model_path str Path to the model. Can be a path to a local directory storing a Turbomind model, or a model_id for models hosted on huggingface.co. N/A
model_name Optional[str] Name of the model when the model_path points to a Pytorch model on huggingface.co. None
backend_config TurbomindEngineConfig | PytorchEngineConfig | None Configuration object for the backend. It can be either TurbomindEngineConfig or PytorchEngineConfig depending on the backend chosen. None, running turbomind backend by default
chat_template_config Optional[ChatTemplateConfig] Configuration for chat template. None
log_level str The level of logging. 'ERROR'

Invocation

Parameter Name Data Type Default Value Description
prompts List[str] None A batch of prompts.
gen_config GenerationConfig or None None An instance of GenerationConfig. Default is None.
do_preprocess bool True Whether to pre-process the messages. Default is True, which means chat_template will be applied.
request_output_len int 512 The number of output tokens. This parameter will be deprecated. Please use the gen_config parameter instead.
top_k int 40 The number of the highest probability vocabulary tokens to keep for top-k-filtering. This parameter will be deprecated. Please use the gen_config parameter instead.
top_p float 0.8 If set to a float < 1, only the smallest set of most probable tokens with probabilities that add up to top_p or higher are kept for generation. This parameter will be deprecated. Please use the gen_config parameter instead.
temperature float 0.8 Used to modulate the next token probability. This parameter will be deprecated. Please use the gen_config parameter instead.
repetition_penalty float 1.0 The parameter for repetition penalty. 1.0 means no penalty. This parameter will be deprecated. Please use the gen_config parameter instead.
ignore_eos bool False Indicator for ignoring end-of-string (eos). This parameter will be deprecated. Please use the gen_config parameter instead.

TurbomindEngineConfig

Description

This class provides the configuration parameters for TurboMind backend.

Arguments

Parameter Type Description Default
model_name str, Optional The chat template name of the deployed model None
model_format str, Optional The layout of the deployed model. Can be one of the following values: hf, llama, awq. None
tp int The number of GPU cards used in tensor parallelism. 1
session_len int, Optional The maximum session length of a sequence. None
max_batch_size int The maximum batch size during inference. 128
cache_max_entry_count float The percentage of GPU memory occupied by the k/v cache. 0.5
quant_policy int Set it to 4 when k/v is quantized into 8 bits. 0
rope_scaling_factor float Scaling factor used for dynamic ntk. TurboMind follows the implementation of transformer LlamaAttention. 0.0
use_logn_attn bool Whether or not to use logarithmic attention. False

PytorchEngineConfig

Description

This class provides the configuration parameters for Pytorch backend.

Arguments

Parameter Type Description Default
model_name str The chat template name of the deployed model ''
tp int Tensor Parallelism. 1
session_len int Maximum session length. None
max_batch_size int Maximum batch size. 128
eviction_type str Action to perform when kv cache is full. Options are ['recompute', 'copy']. 'recompute'
prefill_interval int Interval to perform prefill. 16
block_size int Paging cache block size. 64
num_cpu_blocks int Number of CPU blocks. If the number is 0, cache would be allocated according to the current environment. 0
num_gpu_blocks int Number of GPU blocks. If the number is 0, cache would be allocated according to the current environment. 0
adapters dict The path configs to lora adapters. None

GenerationConfig

Description

This class contains the generation parameters used by inference engines.

Arguments

Parameter Type Description Default
n int Number of chat completion choices to generate for each input message. Currently, only 1 is supported 1
max_new_tokens int Maximum number of tokens that can be generated in chat completion. 512
top_p float Nucleus sampling, where the model considers the tokens with top_p probability mass. 1.0
top_k int The model considers the top_k tokens with the highest probability. 1
temperature float Sampling temperature. 0.8
repetition_penalty float Penalty to prevent the model from generating repeated words or phrases. A value larger than 1 discourages repetition. 1.0
ignore_eos bool Indicator to ignore the eos_token_id or not. False
random_seed int Seed used when sampling a token. None
stop_words List[str] Words that stop generating further tokens. None
bad_words List[str] Words that the engine will never generate. None

FAQs

  • RuntimeError: context has already been set. If you got this for tp>1 in pytorch backend. Please make sure the python script has following
    if __name__ == '__main__':
    Generally, in the context of multi-threading or multi-processing, it might be necessary to ensure that initialization code is executed only once. In this case, if __name__ == '__main__': can help to ensure that these initialization codes are run only in the main program, and not repeated in each newly created process or thread.