diff --git a/docs/README.md b/docs/README.md index 6fa3da5180..22e0c0d691 100644 --- a/docs/README.md +++ b/docs/README.md @@ -69,6 +69,7 @@ The User Guide describes how to configure Triton, organize and configure your mo * Collecting Server Metrics [[Overview](README.md#metrics) || [Details](user_guide/metrics.md)] * Supporting Custom Ops/layers [[Overview](README.md#framework-custom-operations) || [Details](user_guide/custom_operations.md)] * Using the Client API [[Overview](README.md#client-libraries-and-examples) || [Details](https://github.com/triton-inference-server/client)] +* Cancelling Inference Requests [[Overview](README.md#cancelling-inference-requests) || [Details](user_guide/request_cancellation.md)] * Analyzing Performance [[Overview](README.md#performance-analysis)] * Deploying on edge (Jetson) [[Overview](README.md#jetson-and-jetpack)] * Debugging Guide [Details](./user_guide/debugging_guide.md) @@ -165,6 +166,8 @@ Use the [Triton Client](https://github.com/triton-inference-server/client) API t - [Java/Scala](https://github.com/triton-inference-server/client/tree/main/src/grpc_generated/java) - [Javascript](https://github.com/triton-inference-server/client/tree/main/src/grpc_generated/javascript) - [Shared Memory Extension](protocol/extension_shared_memory.md) +### Cancelling Inference Requests +Triton can detect and handle requests that have been cancelled from the client-side. This [document](user_guide/request_cancellation.md) discusses scope and limitations of the feature. ### Performance Analysis Understanding Inference performance is key to better resource utilization. Use Triton's Tools to costomize your deployment. - [Performance Tuning Guide](user_guide/performance_tuning.md) diff --git a/docs/user_guide/request_cancellation.md b/docs/user_guide/request_cancellation.md new file mode 100644 index 0000000000..49865f25c8 --- /dev/null +++ b/docs/user_guide/request_cancellation.md @@ -0,0 +1,101 @@ + + +# Request Cancellation + +Starting from 23.10, Triton supports handling request cancellation received +from the gRPC client or a C API user. Long running inference requests such +as for auto generative large language models may run for an indeterminate +amount of time or indeterminate number of steps. Additionally clients may +enqueue a large number of requests as part of a sequence or request stream +and later determine the results are no longer needed. Continuing to process +requests whose results are no longer required can significantly impact server +resources. + +## Issuing Request Cancellation + +### Triton C API + +[In-Process Triton Server C API](../customization_guide/inference_protocols.md#in-process-triton-server-api) has been enhanced with `TRITONSERVER_InferenceRequestCancel` +and `TRITONSERVER_InferenceRequestIsCancelled` to issue cancellation and query +whether cancellation has been issued on an inflight request respectively. Read more +about the APIs in [tritonserver.h](https://github.com/triton-inference-server/core/blob/main/include/triton/core/tritonserver.h). + + +### gRPC Endpoint + +In addition, [gRPC endpoint](../customization_guide/inference_protocols.md#httprest-and-grpc-protocols) can +now detect cancellation from the client and attempt to terminate request. +At present, only gRPC python client supports issuing request cancellation +to the server endpoint. See [request-cancellation](https://github.com/triton-inference-server/client#request-cancellation) +for more details on how to issue requests from the client-side. +See gRPC guide on RPC [cancellation](https://grpc.io/docs/guides/cancellation/) for +finer details. + +## Handling in Triton Core + +Triton core checks for requests that have been cancelled at some critical points +when using [dynamic](./model_configuration.md#dynamic-batcher) or +[sequence](./model_configuration.md#sequence-batcher) batching. The checking is +also performed between each +[ensemble](./model_configuration.md#ensemble-scheduler) steps and terminates +further processing if the request is cancelled. + +On detecting a cancelled request, Triton core responds with CANCELLED status. If a request +is cancelled when using [sequence_batching](./model_configuration.md#sequence-batcher), +then all the pending requests in the same sequence will also be cancelled. The sequence +is represented by the requests that has identical sequence id. + +**Note**: Currently, Triton core does not detect cancellation status of a request once +it is forwarded to [rate limiter](./rate_limiter.md). Improving the request cancellation +detection and handling within Triton core is work in progress. + +## Handling in Backend + +Upon receiving request cancellation, triton does its best to terminate request +at various points. However, once a request has been given to the backend +for execution, it is upto the individual backends to detect and handle +request termination. +Currently, the following backends support early termination: +- [vLLM backend](https://github.com/triton-inference-server/vllm_backend) +- [python backend](https://github.com/triton-inference-server/python_backend) + +Python backend is a special case where we expose the APIs to detect cancellation +status of the request but it is up to the `model.py` developer to detect whether +the request is cancelled and terminate further execution. + +**For the backend developer**: The backend APIs have also been enhanced to let the +backend detect whether the request received from Triton core has been cancelled. +See `TRITONBACKEND_RequestIsCancelled` and `TRITONBACKEND_ResponseFactoryIsCancelled` +in [tritonbackend.h](https://github.com/triton-inference-server/core/blob/main/include/triton/core/tritonbackend.h) +for more details. The backend upon detecting request cancellation can stop processing +it any further. +The Python models running behind Python backend can also query the cancellation status +of request and response_sender. See [this](https://github.com/triton-inference-server/python_backend#request-cancellation-handling) +section in python backend documentation for more details. +