Releases: triton-inference-server/server
Release 1.10.0 corresponding to NGC container 20.01
NVIDIA TensorRT Inference Server
The NVIDIA TensorRT Inference Server provides a cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server.
What's New In 1.10.0
- Server status can be requested in JSON format using the HTTP/REST API. Use endpoint
/api/status?format=json
. - The dynamic batcher now has an option to preserve the ordering of batched requests when there are multiple model instances. See model_config.proto for more information.
Known Issues
- TensorRT reformat-free I/O is not supported.
- Some versions of Google Kubernetes Engine (GKE) contain a regression in the handling of LD_LIBRARY_PATH that prevents the inference server container from running correctly (see issue 141255952). Use a GKE 1.13 or earlier version or a GKE 1.14.6 or later version to avoid this issue.
Client Libraries and Examples
Ubuntu 16.04 and Ubuntu 18.04 builds of the client libraries and examples are included in this release in the attached v1.10.0_ubuntu1604.clients.tar.gz and v1.10.0_ubuntu1804.clients.tar.gz files. See the documentation section 'Building the Client Libraries and Examples' for more information on using these files. The client SDK is also available as a NGC Container.
Custom Backend SDK
Ubuntu 16.04 and Ubuntu 18.04 builds of the custom backend SDK are included in this release in the attached v1.10.0_ubuntu1604.custombackend.tar.gz and v1.10.0_ubuntu1804.custombackend.tar.gz files. See the documentation section 'Building a Custom Backend' for more information on using these files.
Release 1.9.0, corresponding to NGC container 19.12
NVIDIA TensorRT Inference Server
The NVIDIA TensorRT Inference Server provides a cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server.
What's New In 1.9.0
- The model configuration now includes a model warmup option. This option provides the ability to tune and optimize the model before inference requests are received, avoiding initial inference delays. This option is especially useful for frameworks like TensorFlow that perform network optimization in response to the initial inference requests. Models can be warmed-up with one or more synthetic or realistic workloads before they become ready in the server.
- An enhanced sequence batcher now has multiple scheduling strategies. A new Oldest strategy integrates with the dynamic batcher to enable improved inference performance for models that don’t require all inference requests in a sequence to be routed to the same batch slot.
- The
perf_client
now has an option to generate requests using a realistic poisson distribution or a user provided distribution. - A new repository API (available in the shared library API, HTTP, and GRPC) returns an index of all models available in the model repositories) visible to the server. This index can be used to see what models are available for loading onto the server.
- The server status returned by the server status API now includes the timestamp of the last inference request received for each model.
- Inference server tracing capabilities are now documented in the Optimization section of the User Guide. Tracing support is enhanced to provide trace for ensembles and the contained models.
- A community contributed Dockerfile is now available to build the TensorRT Inference Server clients on CentOS.
Known Issues
- The beta of the custom backend API version 2 has non-backwards compatible changes to enable complete support for input and output tensors in both CPU and GPU memory:
- The signature of the
CustomGetNextInputV2Fn_t
function adds thememory_type_id
argument. - The signature of the
CustomGetOutputV2Fn_t
function adds thememory_type_id
argument.
- The signature of the
- The beta of the inference server library API has non-backwards compatible changes to enable complete support for input and output tensors in both CPU and GPU memory:
- The signature and operation of the
TRTSERVER_ResponseAllocatorAllocFn_t
function has changed. Seesrc/core/trtserver.h
for a description of the new behavior. - The signature of the
TRTSERVER_InferenceRequestProviderSetInputData
function adds thememory_type_id
argument. - The signature of the
TRTSERVER_InferenceResponseOutputData
function add thememory_type_id
argument.
- The signature and operation of the
- TensorRT reformat-free I/O is not supported.
- Some versions of Google Kubernetes Engine (GKE) contain a regression in the handling of LD_LIBRARY_PATH that prevents the inference server container from running correctly (see issue 141255952). Use a GKE 1.13 or earlier version or a GKE 1.14.6 or later version to avoid this issue.
Client Libraries and Examples
Ubuntu 16.04 and Ubuntu 18.04 builds of the client libraries and examples are included in this release in the attached v1.9.0_ubuntu1604.clients.tar.gz and v1.9.0_ubuntu1804.clients.tar.gz files. See the documentation section 'Building the Client Libraries and Examples' for more information on using these files. The client SDK is also available as a NGC Container.
Custom Backend SDK
Ubuntu 16.04 and Ubuntu 18.04 builds of the custom backend SDK are included in this release in the attached v1.9.0_ubuntu1604.custombackend.tar.gz and v1.9.0_ubuntu1804.custombackend.tar.gz files. See the documentation section 'Building a Custom Backend' for more information on using these files.
Release 1.8.0, corresponding to NGC container 19.11
NVIDIA TensorRT Inference Server
The NVIDIA TensorRT Inference Server provides a cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server.
What's New In 1.8.0
- Shared-memory support is expanded to include CUDA shared memory.
- Improve efficiency of pinned-memory used for ensemble models.
- The perf_client application has been improved with easier-to-use
command-line arguments (which maintaining compatibility with existing
arguments). - Support for string tensors added to perf_client.
- Documentation contains a new “Optimization” section discussing some common
optimization strategies and how to use perf_client to explore these
strategies.
Deprecated Features
- The asynchronous inference API has been modified in the C++ and Python client libraries.
- In the C++ library:
- The non-callback version of the
AsyncRun
function was removed. - The
GetReadyAsyncRequest
function was removed. - The signature of the
GetAsyncRunResults
function was changed to remove theis_ready
andwait
arguments.
- The non-callback version of the
- In the Python library:
- The non-callback version of the
async_run
function was removed. - The
get_ready_async_request
function was removed. - The signature of the
get_async_run_results
function was changed to remove thewait
argument.
- The non-callback version of the
- In the C++ library:
Known Issues
- The beta of the custom backend API version 2 has non-backwards compatible changes to enable complete support for input and output tensors in both CPU and GPU memory:
- The signature of the
CustomGetNextInputV2Fn_t
function adds thememory_type_id
argument. - The signature of the
CustomGetOutputV2Fn_t
function adds thememory_type_id
argument.
- The signature of the
- The beta of the inference server library API has non-backwards compatible changes to enable complete support for input and output tensors in both CPU and GPU memory:
- The signature and operation of the
TRTSERVER_ResponseAllocatorAllocFn_t
function has changed. Seesrc/core/trtserver.h
for a description of the new behavior. - The signature of the
TRTSERVER_InferenceRequestProviderSetInputData
function adds thememory_type_id
argument. - The signature of the
TRTSERVER_InferenceResponseOutputData
function add thememory_type_id
argument.
- The signature and operation of the
- TensorRT reformat-free I/O is not supported.
- Some versions of Google Kubernetes Engine (GKE) contain a regression in the handling of LD_LIBRARY_PATH that prevents the inference server container from running correctly (see issue 141255952). Use a GKE 1.13 or earlier version or a GKE 1.14.6 or later version to avoid this issue.
Client Libraries and Examples
Ubuntu 16.04 and Ubuntu 18.04 builds of the client libraries and examples are included in this release in the attached v1.8.0_ubuntu1604.clients.tar.gz and v1.8.0_ubuntu1804.clients.tar.gz files. See the documentation section 'Building the Client Libraries and Examples' for more information on using these files. The client SDK is also available as a NGC Container.
Custom Backend SDK
Ubuntu 16.04 and Ubuntu 18.04 builds of the custom backend SDK are included in this release in the attached v1.8.0_ubuntu1604.custombackend.tar.gz and v1.8.0_ubuntu1804.custombackend.tar.gz files. See the documentation section 'Building a Custom Backend' for more information on using these files.
Release 1.7.0, corresponding to NGC container 19.10
NVIDIA TensorRT Inference Server
The NVIDIA TensorRT Inference Server provides a cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server.
What's New In 1.7.0
-
A Client SDK container is now provided on NGC in addition to the inference server container. The client SDK container includes the client libraries and examples.
-
TensorRT optimization may now be enabled for any TensorFlow model by enabling the feature in the optimization section of the model configuration.
-
The ONNXRuntime backend now includes the TensorRT and Open Vino execution providers. These providers are enabled in the optimization section of the model configuration.
-
Automatic configuration generation (
--strict-model-config=false
) now works correctly for TensorRT models with variable-sized inputs and/or outputs. -
Multiple model repositories may now be specified on the command line. Optional command-line options can be used to explicitly load specific models from each repository.
-
Ensemble models are now pruned dynamically so that only models needed to calculate the requested outputs are executed.
-
The example clients now include a simple Go example that uses the GRPC API.
Known Issues
-
In TensorRT 6.0.1, reformat-free I/O is not supported.
-
Some versions of Google Kubernetes Engine (GKE) contain a regression in the handling of LD_LIBRARY_PATH that prevents the inference server container from running correctly (see issue 141255952). Use a GKE 1.13 or earlier version or a GKE 1.14.6 or later version to avoid this issue.
Client Libraries and Examples
Ubuntu 16.04 and Ubuntu 18.04 builds of the client libraries and examples are included in this release in the attached v1.6.0_ubuntu1604.clients.tar.gz and v1.6.0_ubuntu1804.clients.tar.gz files. See the documentation section 'Building the Client Libraries and Examples' for more information on using these files. The client SDK is also available as a NGC Container.
Custom Backend SDK
Ubuntu 16.04 and Ubuntu 18.04 builds of the custom backend SDK are included in this release in the attached v1.6.0_ubuntu1604.custombackend.tar.gz and v1.6.0_ubuntu1804.custombackend.tar.gz files. See the documentation section 'Building a Custom Backend' for more information on using these files.
Release 1.6.0, corresponding to NGC container 19.09
NVIDIA TensorRT Inference Server
The NVIDIA TensorRT Inference Server provides a cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server.
What's New In 1.6.0
-
Added TensorRT 6 support, which includes support for TensorRT dynamic
shapes. -
Shared memory support is added as an alpha feature in this release. This
support allows input and output tensors to be communicated via shared
memory instead of over the network. Currently only system (CPU) shared
memory is supported. -
Amazon S3 is now supported as a remote file system for model repositories.
Use the s3:// prefix on model repository paths to reference S3 locations. -
The inference server library API is available as a beta in this release.
The library API allows you to link against libtrtserver.so so that you can
include all the inference server functionality directly in your application. -
GRPC endpoint performance improvement. The inference server’s GRPC endpoint
now uses significantly less memory while delivering higher performance. -
The ensemble scheduler is now more flexible in allowing batching and
non-batching models to be composed together in an ensemble. -
The ensemble scheduler will now keep tensors in GPU memory between models
when possible. Doing so significantly increases performance of some ensembles
by avoiding copies to and from system memory. -
The performance client, perf_client, now supports models with variable-sized
input tensors.
Known Issues
-
The ONNX Runtime backend could not be updated to the 0.5.0 release due to multiple performance and correctness issues with that release.
-
In TensorRT 6:
- Reformat-free I/O is not supported.
- Only models that have a single optimization profile are currently supported.
-
Google Kubernetes Engine (GKE) version 1.14 contains a regression in the handling of LD_LIBRARY_PATH that prevents the inference server container from running correctly (see issue 141255952). Use a GKE 1.13 or earlier version to avoid this issue.
Client Libraries and Examples
Ubuntu 16.04 and Ubuntu 18.04 builds of the client libraries and examples are included in this release in the attached v1.6.0_ubuntu1604.clients.tar.gz and v1.6.0_ubuntu1804.clients.tar.gz files. See the documentation section 'Building the Client Libraries and Examples' for more information on using these files.
Custom Backend SDK
Ubuntu 16.04 and Ubuntu 18.04 builds of the custom backend SDK are included in this release in the attached v1.6.0_ubuntu1604.custombackend.tar.gz and v1.6.0_ubuntu1804.custombackend.tar.gz files. See the documentation section 'Building a Custom Backend' for more information on using these files.
Release 1.5.0, corresponding to NGC container 19.08
NVIDIA TensorRT Inference Server
The NVIDIA TensorRT Inference Server provides a cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server.
What's New In 1.5.0
-
Added a new execution mode allows the inference server to start without
loading any models from the model repository. Model loading and unloading
is then controlled by a new GRPC/HTTP model control API. -
Added a new instance-group mode allows TensorFlow models that explicitly
distribute inferencing across multiple GPUs to run in that manner in the
inference server. -
Improved input/output tensor reshape to allow variable-sized dimensions in
tensors being reshaped. -
Added a C++ wrapper around the custom backend C API to simplify the creation
of custom backends. This wrapper is included in the custom backend SDK. -
Improved the accuracy of the compute statistic reported for inference
requests. Previously the compute statistic included some additional time
beyond the actual compute time. -
The performance client, perf_client, now reports more information for ensemble
models, including statistics for all contained models and the entire ensemble.
Client Libraries and Examples
Ubuntu 16.04 and Ubuntu 18.04 builds of the client libraries and examples are included in this release in the attached v1.5.0_ubuntu1604.clients.tar.gz and v1.5.0_ubuntu1804.clients.tar.gz files. See the documentation section 'Building the Client Libraries and Examples' for more information on using these files.
Custom Backend SDK
Ubuntu 16.04 and Ubuntu 18.04 builds of the custom backend SDK are included in this release in the attached v1.5.0_ubuntu1604.custombackend.tar.gz and v1.5.0_ubuntu1804.custombackend.tar.gz files. See the documentation section 'Building a Custom Backend' for more information on using these files.
Release 1.4.0, corresponding to NGC container 19.07
NVIDIA TensorRT Inference Server
The NVIDIA TensorRT Inference Server provides a cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server.
What's New In 1.4.0
-
Added libtorch as a new backend. PyTorch models manually decorated or automatically traced to produce TorchScript can now be run directly by the inference server.
-
Build system converted from bazel to CMake. The new CMake-based build system is more transparent, portable and modular.
-
To simplify the creation of custom backends, a Custom Backend SDK and improved documentation is now available.
-
Improved AsyncRun API in C++ and Python client libraries.
-
perf_client can now use user-supplied input data (previously perf_client could only use random or zero input data).
-
perf_client now reports latency at multiple confidence percentiles (p50, p90, p95, p99) as well as a user-supplied percentile that is also used to stabilize latency results.
-
Improvements to automatic model configuration creation (--strict-model-config=false).
-
C++ and Python client libraries now allow additional HTTP headers to be specified when using the HTTP protocol.
Known Issues
- Google Cloud Storage (GCS) support has been restored in this release.
Client Libraries and Examples
Ubuntu 16.04 and Ubuntu 18.04 builds of the client libraries and examples are included in this release in the attached v1.4.0_ubuntu1604.clients.tar.gz and v1.4.0_ubuntu1804.clients.tar.gz files. See the documentation section 'Building the Client Libraries and Examples' for more information on using these files.
Custom Backend SDK
Ubuntu 16.04 and Ubuntu 18.04 builds of the custom backend SDK are included in this release in the attached v1.4.0_ubuntu1604.custombackend.tar.gz and v1.4.0_ubuntu1804.custombackend.tar.gz files. See the documentation section 'Building a Custom Backend' for more information on using these files.
Release 1.3.0, corresponding to NGC container 19.06
NVIDIA TensorRT Inference Server
The NVIDIA TensorRT Inference Server provides a cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server.
What's New In 1.3.0
-
The ONNX Runtime (github.com/Microsoft/onnxruntime) is now integrated into inference server. ONNX models can now be used directly in a model repository.
-
HTTP health port may be specified independently of inference and status HTTP port with --http-health-port flag.
-
Fixed bug in perf_client that caused high CPU usage that could lower the measured inference/sec in some cases.
Known Issues
- Google Cloud Storage (GCS) support is not available in the 19.06 release. Support for GCS is available on the master branch and will be re-enabled in the 19.07 release.
Client Libraries and Examples
Ubuntu 16.04 and Ubuntu 18.04 builds of the client libraries and examples are included in this release in the attached v1.3.0_ubuntu1604.clients.tar.gz and v1.3.0_ubuntu1804.clients.tar.gz files. See the documentation section 'Building the Client Libraries and Examples' for more information on using these files.
Release 1.2.0, corresponding to NGC container 19.05
NVIDIA TensorRT Inference Server
The NVIDIA TensorRT Inference Server provides a cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server.
What's New In 1.2.0
-
Ensembling is now available. An ensemble represents a pipeline of one or more models and the connection of input and output tensors between those models. A single inference request to an ensemble will trigger the execution of the entire pipeline.
-
Added Helm chart that deploys a single TensorRT Inference Server into a Kubernetes cluster.
-
The client Makefile now supports building for both Ubuntu 16.04 and Ubuntu 18.04. The Python wheel produced from the build is now compatible with both Python2 and Python3.
-
The perf_client application now has a --percentile flag that can be used to report latencies instead of reporting average latency (which remains the default). For example, using --percentile=99 causes perf_client to report the 99th percentile latency.
-
The perf_client application now has a -z option to use zero-valued input tensors instead of random values.
-
Improved error reporting of incorrect input/output tensor names for TensorRT models.
-
Added --allow-gpu-metrics option to enable/disable reporting of GPU metrics.
Client Libraries and Examples
Ubuntu 16.04 and Ubuntu 18.04 builds of the client libraries and examples are included in this release in the attached v1.2.0_ubuntu1604.clients.tar.gz and v1.2.0_ubuntu1804.clients.tar.gz files. See the documentation section 'Building the Client Libraries and Examples' for more information on using these files.
Release 1.1.0, corresponding to NGC container 19.04
NVIDIA TensorRT Inference Server
The NVIDIA TensorRT Inference Server provides a cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server.
What's New In 1.1.0
-
Client libraries and examples now build with a separate Makefile (a Dockerfile is also included for convenience).
-
Input or output tensors with variable-size dimensions (indicated by -1 in the model configuration) can now represent tensors where the variable dimension has value 0 (zero).
-
Zero-sized input and output tensors are now supported for batching models. This enables the inference server to support models that require inputs and outputs that have shape [ batch-size ].
-
TensorFlow custom operations (C++) can now be built into the inference server. An example and documentation are included in this release.
Client Libraries and Examples
An Ubuntu 16.04 build of the client libraries and examples are included in this release in the attached v1.1.0.clients.tar.gz. See the documentation section 'Building the Client Libraries and Examples' for more information on using this file.