From 11a63d68bbeea88ca384f576158f0fe002c6ffd1 Mon Sep 17 00:00:00 2001
From: kthui <18255193+kthui@users.noreply.github.com>
Date: Thu, 4 Apr 2024 14:29:33 -0700
Subject: [PATCH] Add docs for async execute for decoupled model

---
 README.md | 19 ++++++++++++++++---
 1 file changed, 16 insertions(+), 3 deletions(-)

diff --git a/README.md b/README.md
index 93fd212f..74c8d8df 100644
--- a/README.md
+++ b/README.md
@@ -620,9 +620,22 @@ full power of what can be achieved from decoupled API. Read
 [Decoupled Backends and Models](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/decoupled_models.md)
 for more details on how to host a decoupled model.
 
-##### Known Issues
-
-* Currently, decoupled Python models can not make async infer requests.
+##### Async Execute
+
+Starting from 24.04, `async def execute(self, requests):` is supported for
+decoupled Python models. Its coroutine will be executed by an AsyncIO event loop
+shared with requests executing in a model instance. The next request for the
+model instance can start executing while the current request is waiting.
+
+This is useful for minimizing the number of model instances for models that
+spend the majority of its time waiting, given requests can be executed
+"concurrently" by AsyncIO. To take full advantage of the "concurrency", it is
+vital for the async execute function to not block the event loop from making
+progress while it is waiting, i.e. downloading over the network.
+
+Limitations:
+* The server/backend do not control how many requests can be executed
+"concurrently" by a model instance.
 
 #### Request Rescheduling