Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TRT-LLM builder example #232

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

aspctu
Copy link
Contributor

@aspctu aspctu commented Mar 7, 2024

This PR adds a high-performance TRT-LLM builder and serving truss.

Copy link
Member

@philipkiely-baseten philipkiely-baseten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super detailed doc! But I think it needs more signposting and context to make it clear what they're supposed to do and why if they're not already a TRT-LLM expert.


The `quantization_type` field in the `trt_llm` config specifies the quantization method to be used when building the TRT engine:

- `NO_QUANT`: No quantization is applied, and the model runs with full precision (FP32).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean FP16 (not FP32) here & the next line?

### Introduction
TRT-LLM is high-performance language model inference runtime from Nvidia. To use TRT-LLM on a model, users have to _compile_ a TRT engine from their corresponding PyTorch weights. In this Truss, we provide a config-driven interface to build the engine. We build the engine as part of the server startup and then use Triton Inference Server to serve the engine for high-performance workloads.

NOTE: Building an engine can take a very long time. The approach outlined above is not scalable for fast cold-starts or production workloads where autoscaling may cause multiple replicas to build the engine repeatedly.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused by this note — I thought TRT models still had a cold start time of a minute or so?

## TRT-LLM Builder Truss

### Introduction
TRT-LLM is high-performance language model inference runtime from Nvidia. To use TRT-LLM on a model, users have to _compile_ a TRT engine from their corresponding PyTorch weights. In this Truss, we provide a config-driven interface to build the engine. We build the engine as part of the server startup and then use Triton Inference Server to serve the engine for high-performance workloads.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs more high-level context and maybe some links to some blog posts? I also think it's important to mention that compilation is for a specific GPU and models aren't super portable once compiled.


The `serve` section specifies the paths to the compiled TRT engine and tokenizer to use for serving the model.

Note that build and serve are mutually exclusive - you can only specify one or the other in the config.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to set this up at the top of the section: why would you write a build config and why would you write a serve config?

- `FP8`: Quantization to FP8 precision (8-bit floating point).
- `FP8_KV`: Both weights and key-value cache are quantized to FP8.

In general, lower precision quantization (e.g., INT4) results in smaller model size and faster inference, but may suffer from more significant accuracy degradation. Higher precision quantization (e.g., INT8, FP8) strikes a better balance between performance and accuracy.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

- A `tensorrt_llm` model that uses the TensorRT-LLM backend for Triton to run your engine. During model startup, we move the engine into `/packages/tensorrt_llm_model_repository/tensorrt_llm/1/`.
- A `postprocess` model that handles streaming detokenization of tokens produced by the previous model.

These tokens are streamed to the Truss Server (defined in `model/model.py`) and then streamed to the user.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After reading this doc, it's not clear what action I'm supposed to take. What is the end goal of a reader on this doc and what steps do they need to take to get there?

@bolasim bolasim removed their request for review April 16, 2024 16:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants