From 11a63d68bbeea88ca384f576158f0fe002c6ffd1 Mon Sep 17 00:00:00 2001 From: kthui <18255193+kthui@users.noreply.github.com> Date: Thu, 4 Apr 2024 14:29:33 -0700 Subject: [PATCH] Add docs for async execute for decoupled model --- README.md | 19 ++++++++++++++++--- 1 file changed, 16 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index 93fd212f..74c8d8df 100644 --- a/README.md +++ b/README.md @@ -620,9 +620,22 @@ full power of what can be achieved from decoupled API. Read [Decoupled Backends and Models](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/decoupled_models.md) for more details on how to host a decoupled model. -##### Known Issues - -* Currently, decoupled Python models can not make async infer requests. +##### Async Execute + +Starting from 24.04, `async def execute(self, requests):` is supported for +decoupled Python models. Its coroutine will be executed by an AsyncIO event loop +shared with requests executing in a model instance. The next request for the +model instance can start executing while the current request is waiting. + +This is useful for minimizing the number of model instances for models that +spend the majority of its time waiting, given requests can be executed +"concurrently" by AsyncIO. To take full advantage of the "concurrency", it is +vital for the async execute function to not block the event loop from making +progress while it is waiting, i.e. downloading over the network. + +Limitations: +* The server/backend do not control how many requests can be executed +"concurrently" by a model instance. #### Request Rescheduling