Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Text Generation][Doc] Point to KV Cache Injection #1149

Closed
wants to merge 4 commits into from
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 15 additions & 10 deletions src/deepsparse/transformers/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,7 @@ of the sparsified transformers model.

If no model is specified to the `Pipeline` for a given task, the `Pipeline` will automatically
select a pruned and quantized model for the task from the `SparseZoo` that can be used for accelerated
inference. Note that other models in the SparseZoo will have different tradeoffs between speed, size,
inference. Note that other models in the SparseZoo gwill have different tradeoffs between speed, size,
dbogunowicz marked this conversation as resolved.
Show resolved Hide resolved
and accuracy.

### HTTP Server
Expand Down Expand Up @@ -139,31 +139,36 @@ response.text
>> '{"score":0.9534820914268494,"start":8,"end":14,"answer":"batman"}'
```

### Text Generation
The text generation task generates a sequence of tokens given the prompt. Popular text generation LLMs (Large Language Models) are used
### Text Generation
The text generation task generates a sequence of tokens given the prompt. Popular text generation Large Language Models (LLMs) are used
for the chatbots (the instruction models), code generation, text summarization, or filling out the missing text. The following example uses a sparsified text classification
OPT model to complete the prompt
OPT model to complete the prompt.

[List of available SparseZoo Text Generation Models](
https://sparsezoo.neuralmagic.com/?useCase=text_generation)
#### KV Cache Injection
Please note, that to take the full advantage of the speedups provided by the DeepSparse Engine, it is essential to run inference using a model with the KV cache support.
If you are using one of the pre-sparsified models from SparseZoo ([list of available SparseZoo Text Generation Models](
https://sparsezoo.neuralmagic.com/?useCase=text_generation)), you will automatically benefit from the KV cache support speedups.
However, if you are sparsifying your custom model, you may want to add the KV cache support to your model. This will be extremely beneficial when it comes to the inference speed.
dbogunowicz marked this conversation as resolved.
Show resolved Hide resolved

For more details, please refer to the [SparseML documentation on KV cache injection](...)
dbogunowicz marked this conversation as resolved.
Show resolved Hide resolved

#### Python Pipeline
```python
from deepsparse import Pipeline

opt_pipeline = Pipeline.create(task="opt")
opt_pipeline = Pipeline.create(task="opt", max_generated_tokens=32)

inference = opt_pipeline("Who is the president of the United States?")

>> 'The president of the United States is the head of the executive branch of government...'
>> 'The president of the United States is the head of the executive branch of government...' #TODO: Waiting for a good stub to use
```

#### HTTP Server
Spinning up:
```bash
deepsparse.server \
task text-generation \
--model_path # TODO: Pending until text generation models get uploaded to SparseZoo
--model_path zoo:nlg/text_generation/opt-1.3b/pytorch/huggingface/pretrained/pruned50_quant-none #TODO: Waiting for a good stub to use
```

Making a request:
Expand All @@ -177,7 +182,7 @@ obj = {"sequence": "Who is the president of the United States?"}
response = requests.post(url, json=obj)
response.text

>> 'The president of the United States is the head of the executive branch of government...'
>> 'The president of the United States is the head of the executive branch of government...' #TODO: Waiting for a good stub to use
```

### Sentiment Analysis
Expand Down
1 change: 1 addition & 0 deletions src/deepsparse/transformers/pipelines/text_generation.py
Original file line number Diff line number Diff line change
Expand Up @@ -94,6 +94,7 @@ class Config:
@Pipeline.register(
task="text_generation",
task_aliases=["codegen", "opt", "bloom"],
default_model_path="zoo:nlg/text_generation/opt-1.3b/pytorch/huggingface/pretrained/pruned50_quant-none", # noqa E501
)
class TextGenerationPipeline(TransformersPipeline):
"""
Expand Down