TRT-LLM builder example #232

aspctu · 2024-03-07T14:44:35Z

This PR adds a high-performance TRT-LLM builder and serving truss.

philipkiely-baseten

Super detailed doc! But I think it needs more signposting and context to make it clear what they're supposed to do and why if they're not already a TRT-LLM expert.

philipkiely-baseten · 2024-03-12T21:33:44Z

11-trt-llm/README.md

+
+The `quantization_type` field in the `trt_llm` config specifies the quantization method to be used when building the TRT engine:
+
+- `NO_QUANT`: No quantization is applied, and the model runs with full precision (FP32).


Do you mean FP16 (not FP32) here & the next line?

philipkiely-baseten · 2024-03-12T21:35:19Z

11-trt-llm/README.md

+### Introduction
+TRT-LLM is high-performance language model inference runtime from Nvidia. To use TRT-LLM on a model, users have to _compile_ a TRT engine from their corresponding PyTorch weights. In this Truss, we provide a config-driven interface to build the engine. We build the engine as part of the server startup and then use Triton Inference Server to serve the engine for high-performance workloads.
+
+NOTE: Building an engine can take a very long time. The approach outlined above is not scalable for fast cold-starts or production workloads where autoscaling may cause multiple replicas to build the engine repeatedly.


I'm confused by this note — I thought TRT models still had a cold start time of a minute or so?

philipkiely-baseten · 2024-03-12T21:37:35Z

11-trt-llm/README.md

+## TRT-LLM Builder Truss
+
+### Introduction
+TRT-LLM is high-performance language model inference runtime from Nvidia. To use TRT-LLM on a model, users have to _compile_ a TRT engine from their corresponding PyTorch weights. In this Truss, we provide a config-driven interface to build the engine. We build the engine as part of the server startup and then use Triton Inference Server to serve the engine for high-performance workloads.


I think this needs more high-level context and maybe some links to some blog posts? I also think it's important to mention that compilation is for a specific GPU and models aren't super portable once compiled.

philipkiely-baseten · 2024-03-12T21:38:07Z

11-trt-llm/README.md

+
+The `serve` section specifies the paths to the compiled TRT engine and tokenizer to use for serving the model. 
+
+Note that build and serve are mutually exclusive - you can only specify one or the other in the config.


Would be good to set this up at the top of the section: why would you write a build config and why would you write a serve config?

philipkiely-baseten · 2024-03-12T21:38:34Z

11-trt-llm/README.md

+- `FP8`: Quantization to FP8 precision (8-bit floating point).
+- `FP8_KV`: Both weights and key-value cache are quantized to FP8.
+
+In general, lower precision quantization (e.g., INT4) results in smaller model size and faster inference, but may suffer from more significant accuracy degradation. Higher precision quantization (e.g., INT8, FP8) strikes a better balance between performance and accuracy.


Link: https://www.baseten.co/blog/introduction-to-quantizing-ml-models/

philipkiely-baseten · 2024-03-12T21:40:17Z

11-trt-llm/README.md

+- A `tensorrt_llm` model that uses the TensorRT-LLM backend for Triton to run your engine. During model startup, we move the engine into `/packages/tensorrt_llm_model_repository/tensorrt_llm/1/`. 
+- A `postprocess` model that handles streaming detokenization of tokens produced by the previous model. 
+
+These tokens are streamed to the Truss Server (defined in `model/model.py`) and then streamed to the user. 


After reading this doc, it's not clear what action I'm supposed to take. What is the end goal of a reader on this doc and what steps do they need to take to get there?

init

94c0deb

aspctu requested review from bolasim, joostinyi and philipkiely-baseten March 7, 2024 14:44

better base image

a603729

philipkiely-baseten reviewed Mar 12, 2024

View reviewed changes

aspctu added 3 commits March 20, 2024 02:39

deepseek update

ca13985

tp fix

96a618d

updates!

ab4029c

bolasim removed their request for review April 16, 2024 16:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TRT-LLM builder example #232

TRT-LLM builder example #232

aspctu commented Mar 7, 2024

philipkiely-baseten left a comment

philipkiely-baseten Mar 12, 2024

philipkiely-baseten Mar 12, 2024

philipkiely-baseten Mar 12, 2024

philipkiely-baseten Mar 12, 2024

philipkiely-baseten Mar 12, 2024

philipkiely-baseten Mar 12, 2024


		The `quantization_type` field in the `trt_llm` config specifies the quantization method to be used when building the TRT engine:

		- `NO_QUANT`: No quantization is applied, and the model runs with full precision (FP32).


		The `serve` section specifies the paths to the compiled TRT engine and tokenizer to use for serving the model.

		Note that build and serve are mutually exclusive - you can only specify one or the other in the config.

TRT-LLM builder example #232

Are you sure you want to change the base?

TRT-LLM builder example #232

Conversation

aspctu commented Mar 7, 2024

philipkiely-baseten left a comment

Choose a reason for hiding this comment

philipkiely-baseten Mar 12, 2024

Choose a reason for hiding this comment

philipkiely-baseten Mar 12, 2024

Choose a reason for hiding this comment

philipkiely-baseten Mar 12, 2024

Choose a reason for hiding this comment

philipkiely-baseten Mar 12, 2024

Choose a reason for hiding this comment

philipkiely-baseten Mar 12, 2024

Choose a reason for hiding this comment

philipkiely-baseten Mar 12, 2024

Choose a reason for hiding this comment