-
Notifications
You must be signed in to change notification settings - Fork 68
/
release_notes.md
48 lines (31 loc) · 2.49 KB
/
release_notes.md
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
# LMI V10 containers release
This document will contain the latest releases of our LMI containers. For details on any other previous releases, please refer our [github release page](https://github.com/deepjavalibrary/djl-serving/releases)
## Release Notes
### Release date: August 16, 2024
Check out our latest [Large Model Inference Containers](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#large-model-inference-containers).
### Key Features
#### DJL Serving Changes (applicable to all containers)
* Allows configuring health checks to fail based on various types of error rates
* When not streaming responses, all invocation errors will respond with the appropriate 4xx or 5xx HTTP response code
* Previously, for some inference backends (vllm, lmi-dist, tensorrt-llm) the behavior was to return 2xx HTTP responses when errors occurred during inference
* HTTP Response Codes are now configurable if you require a specific 4xx or 5xx status to be returned in certain situations
* Introduced annotations `@input_formatter` and `@output_formatter` to bring your own script for pre- and post-postprocessing.
#### LMI Container (vllm, lmi-dist)
* vLLM updated to version 0.5.3.post1
* Added MultiModal Support for Vision Language Models using the OpenAI Chat Completions Schema.
* More details available [here](https://github.com/deepjavalibrary/djl-serving/blob/v0.29.0/serving/docs/lmi/user_guides/vision_language_models.md)
* Supports Llama 3.1 models
* Supports beam search, `best_of` and `n` with non streaming output.
* Supports chunked prefill support in both vllm and lmi-dist.
#### TensorRT-LLM Container
* TensorRT-LLM updated to version 0.11.0
* **[Breaking change]** Flan-T5 is now supported with C++ triton backend. Removed Flan-T5 support for TRTLLM python backend.
#### Transformers NeuronX Container
* Upgraded to Transformers NeuronX 2.19.1
#### Text Embedding (using the LMI container)
* Various performance improvements
### Breaking Changes
* In the TensorRT-LLM container, Flan-T5 is now supported with C++ triton backend. Removed Flan-T5 support for TRTLLM python backend.
### Known Issues
* Running Gemma and Phi models with TensorRT-LLM is only viable currently at TP=1 because of an issue in TensorRT-LLM where one engine is built even when TP > 1.
* When using LMI-dist, in the rare case that the machine has a broken cuda driver, it causes hanging. In that case, set LMI_USE_VLLM_GPU_P2P_CHECK=1 to prompt LMI to use a fallback option compatible with the broken cuda driver.