enc-dec triton backend support #800

shannonphu · 2024-01-03T23:57:53Z

Hi is there any update on when enc-dec models like T5 will get the TRT-LLM Triton backend support? Posting an issue for awareness and just wanted to know if its still being planned. Thanks in advance!

#424 (reply in thread)

symphonylyh · 2024-01-07T00:07:29Z

Hi @shannonphu , yes we're working on it. Right now it's at the stage of adding the C++ runtime. Tentative date for Triton enc-dec support is around mid to late January. Thanks for your patience

sihanwang41 · 2024-01-08T18:30:51Z

Hi @shannonphu , yes we're working on it. Right now it's at the stage of adding the C++ runtime. Tentative date for Triton enc-dec support is around mid to late January. Thanks for your patience

is it also included continuous batching?

symphonylyh · 2024-01-08T23:21:09Z

is it also included continuous batching?
Our current plan is to reach there by steps: (1) C++ runtime (2) regular Triton support (3) continous batching. Eventually we want to enable continus batching, but for the mid to late January release it's more likely to only have (1) and (2), with (3) coming right after it

mlmonk · 2024-02-01T20:49:31Z

@symphonylyh Could share if theres an update on this?

shixianc · 2024-02-09T04:26:07Z

Hi is there an update for this?

symphonylyh · 2024-02-22T01:49:45Z

Hi @shannonphu , @sihanwang41 , @mlmonk , @shixianc ,
We have been actively working on this support but finding the amount of work is more than expected since we want to have a good implementation to support enc-dec and in general such 2-stage pipeline.

May I use this thread to collect your feedback so we can understand your need and prioritize better. I know @sihanwang41 specifically asked about continuous batching, i.e., inflight batching, but others didn't share the request info. Can you reply by describing if any of (1), (2), (3) would be helpful and can unblock you first:
(1) a Triton Python backend support to run enc-dec model
(2) a C++ runtime (no Triton) to run enc-dec model, without inflight batching
(3) a Triton C++ backend to run enc-dec model, without inflight batching
(4) a Triton C++ backend, with paged kv cache and inflight batching for enc-dec <-- final goal

Thanks

shixianc · 2024-02-22T05:23:04Z

@symphonylyh Thanks for the update! Starting with (3) would unblock our team.

May I assume this would also have the classic dynamic batching supported?

symphonylyh · 2024-02-23T10:02:54Z

@symphonylyh Thanks for the update! Starting with (3) would unblock our team.

May I assume this would also have the classic dynamic batching supported?

Got it, thanks for the input.
By dynamic batching, do you mean the Triton's dynamic batching that has nothing to do with the inflight/continuous batching concept. If so, yes.

shannonphu · 2024-02-23T22:31:31Z

@symphonylyh (1) and/or (3). I am not super clear on the difference between the Python vs C++ backend. I was using this to build the engine https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/enc_dec/README.md

mlmonk · 2024-02-24T00:10:15Z

We have been able to use Triton with enc_dec models, so I'm not sure what the difference that and (1) is. We find that the TPS for that implementation is quite slow are looking for ways to make it faster. Agree that the end goal is (3).

…

On Fri, Feb 23, 2024, 5:31 PM Shannon Phu ***@***.***> wrote: @symphonylyh <https://github.com/symphonylyh> (1) and/or (3). I am not super clear on the difference between the Python vs C++ backend. I was using this to build the engine https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/enc_dec/README.md — Reply to this email directly, view it on GitHub <#800 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACSI76N7HW6ZZB2ANBZNQBDYVEKFBAVCNFSM6AAAAABBMET6RGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNRSGA4DOMJZGU> . You are receiving this because you were mentioned.Message ID: ***@***.***>

shannonphu · 2024-02-24T05:10:21Z

@mlmonk Oh interesting, I was under the impression that we just couldn't serve T5 models on Triton yet because the TRT-LLM backend wasn't ready for it yet.

mlmonk · 2024-03-07T16:20:37Z

@symphonylyh @shannonphu We have been able to use the Flan-T5 with Triton. I believe this is (1). You can reproduce it here. Note that this is much older version of both libraries when Flan-T5 was not officially supported.

Like @shixianc mentioned, (3) would unblock us and (4) would the ideal state. It would be great if you could share how far along you are with the (3) release.

LuckyL00ser · 2024-03-13T10:48:07Z

hey @symphonylyh , do you have any updates on the progress?

XiaobingSuper · 2024-04-11T08:46:42Z

@symphonylyh, any progress?

TeamSeshDeadBoy · 2024-05-08T12:32:08Z

Hello, @symphonylyh . Is there any progress on any of (1-4) ?

mrmuke · 2024-05-13T23:01:34Z

We would love (1)

symphonylyh · 2024-06-04T16:54:47Z

Hi @shannonphu , @sihanwang41 , @mlmonk , @shixianc, @LuckyL00ser , @XiaobingSuper @TeamSeshDeadBoy @mrmuke

As part of today's release #1725 , enc-dec C++ runtime has been successfully implemented with inflight batching and paged kv cache. Please have a try following the README C++ runtime section
. This directly corresponds to (4) above, with Triton backend being added next.

Our roadmap next pretty soon:

Triton C++ backend is almost ready and to be released soon
Multi-GPU support

mlmonk · 2024-06-05T03:01:05Z

Thanks for the update! This is excellent news, I'm sure it was a lot of effort to make it happen.

HamzaG737 · 2024-07-09T09:43:02Z

Hello @symphonylyh,
Is there any progress on adding (1) ?

symphonylyh · 2024-07-10T05:54:46Z

@HamzaG737 it's full-fledged now. For (1) Triton backend, you can follow the guide here: https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/encoder_decoder.md.

Also, closing this issue as support has been added

byshiue assigned symphonylyh Jan 5, 2024

symphonylyh mentioned this issue Jan 8, 2024

mT5 directory structure triton-inference-server/tensorrtllm_backend#279

Open

symphonylyh mentioned this issue Feb 23, 2024

Support for multimodal model triton-inference-server/tensorrtllm_backend#344

Open

kaiyux mentioned this issue Jun 4, 2024

Update TensorRT-LLM #1725

Merged

symphonylyh closed this as completed Jul 10, 2024

kaiyux mentioned this issue Jul 17, 2024

TensorRT-LLM v0.11 Update #1969

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

enc-dec triton backend support #800

enc-dec triton backend support #800

shannonphu commented Jan 3, 2024

symphonylyh commented Jan 7, 2024

sihanwang41 commented Jan 8, 2024

symphonylyh commented Jan 8, 2024

mlmonk commented Feb 1, 2024

shixianc commented Feb 9, 2024

symphonylyh commented Feb 22, 2024

shixianc commented Feb 22, 2024 •

edited

Loading

symphonylyh commented Feb 23, 2024

shannonphu commented Feb 23, 2024

mlmonk commented Feb 24, 2024 via email

shannonphu commented Feb 24, 2024

mlmonk commented Mar 7, 2024

LuckyL00ser commented Mar 13, 2024

XiaobingSuper commented Apr 11, 2024

TeamSeshDeadBoy commented May 8, 2024

mrmuke commented May 13, 2024

symphonylyh commented Jun 4, 2024

mlmonk commented Jun 5, 2024

HamzaG737 commented Jul 9, 2024

symphonylyh commented Jul 10, 2024

enc-dec triton backend support #800

enc-dec triton backend support #800

Comments

shannonphu commented Jan 3, 2024

symphonylyh commented Jan 7, 2024

sihanwang41 commented Jan 8, 2024

symphonylyh commented Jan 8, 2024

mlmonk commented Feb 1, 2024

shixianc commented Feb 9, 2024

symphonylyh commented Feb 22, 2024

shixianc commented Feb 22, 2024 • edited Loading

symphonylyh commented Feb 23, 2024

shannonphu commented Feb 23, 2024

mlmonk commented Feb 24, 2024 via email

shannonphu commented Feb 24, 2024

mlmonk commented Mar 7, 2024

LuckyL00ser commented Mar 13, 2024

XiaobingSuper commented Apr 11, 2024

TeamSeshDeadBoy commented May 8, 2024

mrmuke commented May 13, 2024

symphonylyh commented Jun 4, 2024

mlmonk commented Jun 5, 2024

HamzaG737 commented Jul 9, 2024

symphonylyh commented Jul 10, 2024

shixianc commented Feb 22, 2024 •

edited

Loading