-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TRT-LLM builder example #232
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super detailed doc! But I think it needs more signposting and context to make it clear what they're supposed to do and why if they're not already a TRT-LLM expert.
|
||
The `quantization_type` field in the `trt_llm` config specifies the quantization method to be used when building the TRT engine: | ||
|
||
- `NO_QUANT`: No quantization is applied, and the model runs with full precision (FP32). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean FP16 (not FP32) here & the next line?
### Introduction | ||
TRT-LLM is high-performance language model inference runtime from Nvidia. To use TRT-LLM on a model, users have to _compile_ a TRT engine from their corresponding PyTorch weights. In this Truss, we provide a config-driven interface to build the engine. We build the engine as part of the server startup and then use Triton Inference Server to serve the engine for high-performance workloads. | ||
|
||
NOTE: Building an engine can take a very long time. The approach outlined above is not scalable for fast cold-starts or production workloads where autoscaling may cause multiple replicas to build the engine repeatedly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused by this note — I thought TRT models still had a cold start time of a minute or so?
## TRT-LLM Builder Truss | ||
|
||
### Introduction | ||
TRT-LLM is high-performance language model inference runtime from Nvidia. To use TRT-LLM on a model, users have to _compile_ a TRT engine from their corresponding PyTorch weights. In this Truss, we provide a config-driven interface to build the engine. We build the engine as part of the server startup and then use Triton Inference Server to serve the engine for high-performance workloads. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this needs more high-level context and maybe some links to some blog posts? I also think it's important to mention that compilation is for a specific GPU and models aren't super portable once compiled.
|
||
The `serve` section specifies the paths to the compiled TRT engine and tokenizer to use for serving the model. | ||
|
||
Note that build and serve are mutually exclusive - you can only specify one or the other in the config. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be good to set this up at the top of the section: why would you write a build config and why would you write a serve config?
- `FP8`: Quantization to FP8 precision (8-bit floating point). | ||
- `FP8_KV`: Both weights and key-value cache are quantized to FP8. | ||
|
||
In general, lower precision quantization (e.g., INT4) results in smaller model size and faster inference, but may suffer from more significant accuracy degradation. Higher precision quantization (e.g., INT8, FP8) strikes a better balance between performance and accuracy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- A `tensorrt_llm` model that uses the TensorRT-LLM backend for Triton to run your engine. During model startup, we move the engine into `/packages/tensorrt_llm_model_repository/tensorrt_llm/1/`. | ||
- A `postprocess` model that handles streaming detokenization of tokens produced by the previous model. | ||
|
||
These tokens are streamed to the Truss Server (defined in `model/model.py`) and then streamed to the user. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After reading this doc, it's not clear what action I'm supposed to take. What is the end goal of a reader on this doc and what steps do they need to take to get there?
This PR adds a high-performance TRT-LLM builder and serving truss.