Improve performance tuning guide #6026

Tabrizian · 2023-07-05T21:21:20Z

Starting the server/client container in the background would immediately exit.

Updated the guide to use interactive container and fixed some formatting issues.

* Changed copyright (triton-inference-server#5705) * Modify timeout test in L0_sequence_batcher to use portable backend (triton-inference-server#5696) * Modify timeout test in L0_sequence_batcher to use portable backend * Use identity backend that is built by default on Windows * updated upstream container name (triton-inference-server#5713) * Fix triton container version (triton-inference-server#5714) * Update the L0_model_config test expected error message (triton-inference-server#5684) * Use better value in timeout test L0_sequence_batcher (triton-inference-server#5716) * Use better value in timeout test L0_sequence_batcher * Format * Update JAX install (triton-inference-server#5613) * Add notes about socket usage to L0_client_memory_growth test (triton-inference-server#5710) * Check TensorRT error message more granularly (triton-inference-server#5719) * Check TRT err msg more granularly * Clarify source of error messages * Consolidate tests for message parts * Pin Python Package Versions for HTML Document Generation (triton-inference-server#5727) * updating with pinned versions for python dependencies * updated with pinned sphinx and nbclient versions * Test full error returned when custom batcher init fails (triton-inference-server#5729) * Add testing for batcher init failure, add wait for status check * Formatting * Change search string * Add fastertransformer test (triton-inference-server#5500) Add fastertransformer test that uses 1GPU. * Fix L0_backend_python on Jetson (triton-inference-server#5728) * Don't use mem probe in Jetson * Clarify failure messages in L0_backend_python * Update copyright * Add JIRA ref, fix _test_jetson * Add testing for Python custom metrics API (triton-inference-server#5669) * Add testing for python custom metrics API * Add custom metrics example to the test * Fix for CodeQL report * Fix test name * Address comment * Add logger and change the enum usage * Add testing for Triton Client Plugin API (triton-inference-server#5706) * Add HTTP client plugin test * Add testing for HTTP asyncio * Add async plugin support * Fix qa container for L0_grpc * Add testing for grpc client plugin * Remove unused imports * Fix up * Fix L0_grpc models QA folder * Update the test based on review feedback * Remove unused import * Add testing for .plugin method * Install jemalloc (triton-inference-server#5738) * Add --metrics-address and testing (triton-inference-server#5737) * Add --metrics-address, add tests to L0_socket, re-order CLI options for consistency * Use non-localhost address * Add testing for basic auth plugin for HTTP/gRPC clients (triton-inference-server#5739) * Add HTTP basic auth test * Add testing for gRPC basic auth * Fix up * Remove unused imports * Add multi-gpu, multi-stream testing for dlpack tensors (triton-inference-server#5550) * Add multi-gpu, multi-stream testing for dlpack tensors * Update note on SageMaker MME support for ensemble (triton-inference-server#5723) * Run L0_backend_python subtests with virtual environment (triton-inference-server#5753) * Update 'main' to track development of 2.35.0 / r23.06 (triton-inference-server#5764) * Include jemalloc into the documentation (triton-inference-server#5760) * Enhance tests in L0_model_update (triton-inference-server#5724) * Add model instance name update test * Add gap for timestamp to update * Add some tests with dynamic batching * Extend supported test on rate limit off * Continue test if off mode failed * Fix L0_memory_growth (triton-inference-server#5795) (1) reduce MAX_ALLOWED_ALLOC to be more strict for bounded tests, and generous for unbounded tests. (2) allow unstable measurement from PA. (3) improve logging for future triage * Add note on --metrics-address (triton-inference-server#5800) * Add note on --metrics-address * Copyright * Minor fix for running "mlflow deployments create -t triton --flavor triton ..." (triton-inference-server#5658) UnboundLocalError: local variable 'meta_dict' referenced before assignment The above error shows in listing models in Triton model repository * Adding test for new sequence mode (triton-inference-server#5771) * Adding test for new sequence mode * Update option name * Clean up testing spacing and new lines * MLFlow Triton Plugin: Add support for s3 prefix and custom endpoint URL (triton-inference-server#5686) * MLFlow Triton Plugin: Add support for s3 prefix and custom endpoint URL Signed-off-by: Xiaodong Ye <[email protected]> * Update the function order of config.py and use os.path.join to replace filtering a list of strings then joining Signed-off-by: Xiaodong Ye <[email protected]> * Update onnx flavor to support s3 prefix and custom endpoint URL Signed-off-by: Xiaodong Ye <[email protected]> * Fix two typos in MLFlow Triton plugin README.md Signed-off-by: Xiaodong Ye <[email protected]> * Address review comments (replace => strip) Signed-off-by: Xiaodong Ye <[email protected]> * Address review comments (init regex only for s3) Signed-off-by: Xiaodong Ye <[email protected]> * Remove unused local variable: slash_locations Signed-off-by: Xiaodong Ye <[email protected]> --------- Signed-off-by: Xiaodong Ye <[email protected]> * Fix client script (triton-inference-server#5806) * Add MLFlow test for already loaded models. Update copyright year (triton-inference-server#5808) * Use the correct gtest filter (triton-inference-server#5824) * Add error message test on S3 access decline (triton-inference-server#5825) * Add test on access decline * Fix typo * Add MinIO S3 access decline test * Make sure bucket exists during access decline test * Restore AWS_SECRET_ACCESS_KEY on S3 local test (triton-inference-server#5832) * Restore AWS_SECRET_ACCESS_KEY * Add reason for restoring keys * nnshah1 stream infer segfault fix (triton-inference-server#5842) match logic from infer_handler.cc * Remove unused test (triton-inference-server#5851) * Add and document memory usage in statistic protocol (triton-inference-server#5642) * Add and document memory usage in statistic protocol * Fix doc * Fix up * [DO NOT MERGE Add test. FIXME: model generation * Fix up * Fix style * Address comment * Fix up * Set memory tracker backend option in build.py * Fix up * Add CUPTI library in Windows image build * Add note to build with memory tracker by default * use correct lib dir on CentOS (triton-inference-server#5836) * use correct lib dir on CentOS * use new location for opentelemetry-cpp * Document that gpu-base flag is optional for cpu-only builds (triton-inference-server#5861) * Update Jetson tests in Docker container (triton-inference-server#5734) * Add flags for ORT build * Separate list with commas * Remove unnecessary detection of nvcc compiler * Fixed Jetson path for perf_client, datadir * Create version directoryy for custom model * Remove probe check for shm, add shm exceed error for Jetson * Copyright updates, fix Jetson Probe * Fix be_python test num on Jetson * Remove extra comma, non-Dockerized Jetson comment * Remove comment about Jetson being non-dockerized * Remove no longer needed flag * Update `main` post-23.05 release (triton-inference-server#5880) * Update README and versions for 23.05 branch * Changes to support 23.05 (triton-inference-server#5782) * Update python and conda version * Update CMAKE installation * Update checksum version * Update ubuntu base image to 22.04 * Use ORT 1.15.0 * Set CMAKE to pull latest version * Update libre package version * Removing unused argument * Adding condition for ubuntu 22.04 * Removing installation of the package from the devel container * Nnshah1 u22.04 (triton-inference-server#5770) * Update CMAKE installation * Update python and conda version * Update CMAKE installation * Update checksum version * Update ubuntu base image to 22.04 * updating versions for ubuntu 22.04 * remove re2 --------- Co-authored-by: Neelay Shah <[email protected]> Co-authored-by: Neelay Shah <[email protected]> * Set ONNX version to 1.13.0 * Fix L0_custom_ops for ubuntu 22.04 (triton-inference-server#5775) * add back rapidjson-dev --------- Co-authored-by: Neelay Shah <[email protected]> Co-authored-by: Neelay Shah <[email protected]> Co-authored-by: nv-kmcgill53 <[email protected]> * Fix L0_mlflow (triton-inference-server#5805) * working thread * remove default install of blinker * merge issue fixed * Fix L0_backend_python/env test (triton-inference-server#5799) * Fix L0_backend_python/env test * Address comment * Update the copyright * Fix up * Fix L0_http_fuzz (triton-inference-server#5776) * installing python 3.8.16 for test * spelling Co-authored-by: Neelay Shah <[email protected]> * use util functions to install python3.8 in an easier way --------- Co-authored-by: Neelay Shah <[email protected]> * Update Windows versions for 23.05 release (triton-inference-server#5826) * Rename Ubuntu 20.04 mentions to 22.04 (triton-inference-server#5849) * Update DCGM version (triton-inference-server#5856) * Update DCGM version (triton-inference-server#5857) * downgrade DCGM version to 2.4.7 (triton-inference-server#5860) * Updating link for latest release notes to 23.05 --------- Co-authored-by: Neelay Shah <[email protected]> Co-authored-by: Neelay Shah <[email protected]> Co-authored-by: nv-kmcgill53 <[email protected]> Co-authored-by: Iman Tabrizian <[email protected]> * Disable memory tracker on Jetpack until the library is available (triton-inference-server#5882) * Fix datadir for x86 (triton-inference-server#5894) * Add more test on instance signature (triton-inference-server#5852) * Add testing for new error handling API (triton-inference-server#5892) * Test batch input for libtorch (triton-inference-server#5855) * Draft ragged TensorRT unit model gen * Draft libtorch special identity model * Autoformat * Update test, fix ragged model gen * Update suffix for io for libtorch * Remove unused variables * Fix io names for libtorch * Use INPUT0/OUTPUT0 for libtorch * Reorder to match test model configs * Remove unnecessary capitalization * Auto-format * Capitalization is necessary * Remove unnecessary export * Clean up Azure dependency in server build (triton-inference-server#5900) * [DO NOT MERGE] * Remove Azure dependency in server component build * Finalize * Fix dependency * Fixing up * Clean up * Add response parameters for streaming GRPC inference to enhance decoupled support (triton-inference-server#5878) * Update 'main' to track development of 2.36.0 / 23.07 (triton-inference-server#5917) * Add test for detecting S3 http2 upgrade request (triton-inference-server#5911) * Add test for detecting S3 http2 upgrade request * Enhance testing * Copyright year update * Add Redis cache build, tests, and docs (triton-inference-server#5916) * Updated handling for uint64 request priority * Ensure HPCX dependencies found in container (triton-inference-server#5922) * Add HPCX dependencies to search path * Copy hpcx to CPU-only container * Add ucc path to CPU-only image * Fixed if statement * Fix df variable * Combine hpcx LD_LIBRARY_PATH * Add test case where MetricFamily is deleted before deleting Metric (triton-inference-server#5915) * Add test case for metric lifetime error handling * Address comment * Use different MetricFamily name * Add testing for Pytorch instance group kind MODEL (triton-inference-server#5810) * Add testing for Pytorch instance group kind MODEL * Remove unused item * Update testing to verify the infer result * Add copyright * Remove unused import * Update pip install * Update the model to use the same add sub logic * Add torch multi-gpu and multi-device models to L0_io * Fix up model version * Add test for sending instance update config via load API (triton-inference-server#5937) * Add test for passing config via load api * Add more docs on instance update behavior * Update to suggested docs Co-authored-by: Ryan McCormick <[email protected]> * Use dictionary for json config * Modify the config fetched from Triton instead --------- Co-authored-by: Ryan McCormick <[email protected]> * Fix L0_batcher count check (triton-inference-server#5939) * Add testing for json tensor format (triton-inference-server#5914) * Add redis config and use local logfile for redis server (triton-inference-server#5945) * Add redis config and use local logfile for redis server * Move redis log config to CLI * Have separate redis logs for unit tests and CLI tests * Add test on rate limiter max resource decrease update (triton-inference-server#5885) * Add test on rate limiter max resource decrease update * Add test with explicit resource * Check server log for decreased resource limit * Add docs on decoupled final response feature (triton-inference-server#5936) * Allow changing ping behavior based on env variable in SageMaker and entrypoint updates (triton-inference-server#5910) * Allow changing ping behavior based on env variable in SageMaker * Add option for additional args * Make ping further configurable * Allow further configuration of grpc and http ports * Update docker/sagemaker/serve * Update docker/sagemaker/serve --------- Co-authored-by: GuanLuo <[email protected]> * Remove only MPI libraries in HPCX in L0_perf_analyzer (triton-inference-server#5967) * Be more specific with MPI removal * Delete all libmpi libs * Ensure L0_batch_input requests received in order (triton-inference-server#5963) * Add print statements for debugging * Add debugging print statements * Test using grpc client with stream to fix race * Use streaming client in all non-batch tests * Switch all clients to streaming GRPC * Remove unused imports, vars * Address comments * Remove random comment * Set inputs as separate function * Split set inputs based on test type * Add test for redis cache auth credentials via env vars (triton-inference-server#5966) * Auto-formatting (triton-inference-server#5979) * Auto-format * Change to clang-format-15 in CONTRIBTUING * Adding tests ensuring locale setting is passed to python backend interpreter * Refactor build.py CPU-only Linux libs for readability (triton-inference-server#5990) * Improve the error message when the number of GPUs is insufficient (triton-inference-server#5993) * Update README to include CPP-API Java Bindings (triton-inference-server#5883) * Update env variable to use for overriding /ping behavior (triton-inference-server#5994) * Add test that >1000 model files can be loaded in S3 (triton-inference-server#5976) * Add test for >1000 files * Capitalization for consistency * Add bucket cleaning at end * Move test pass/fail to end * Check number of files in model dir at load time * Add testing for GPU tensor error handling (triton-inference-server#5871) * Add testing for GPU tensor error handling * Fix up * Remove exit 0 * Fix jetson * Fix up * Add test for Python BLS model loading API (triton-inference-server#5980) * Add test for Python BLS model loading API * Fix up * Update README and versions for 23.06 branch * Fix LD_LIBRARY_PATH for PyTorch backend * Return updated df in add_cpu_libs * Remove unneeded df param * Update test failure messages to match Dataloader changes (triton-inference-server#6006) * Add dependency for L0_python_client_unit_tests (triton-inference-server#6010) * Improve performance tuning guide (triton-inference-server#6026) * Enabling nested spans for trace mode OpenTelemetry (triton-inference-server#5928) * Adding nested spans to OTel tracing + support of ensemble models * Move multi-GPU dlpack test to a separate L0 test (triton-inference-server#6001) * Move multi-GPU dlpack test to a separate L0 test * Fix copyright * Fix up * OpenVINO 2023.0.0 (triton-inference-server#6031) * Upgrade OV to 2023.0.0 * Upgrade OV model gen script to 2023.0.0 * Add test to check the output memory type for onnx models (triton-inference-server#6033) * Add test to check the output memory type for onnx models * Remove unused import * Address comment * Add testing for implicit state for PyTorch backend (triton-inference-server#6016) * Add testing for implicit state for PyTorch backend * Add testing for libtorch string implicit models * Fix CodeQL * Mention that libtorch backend supports implicit state * Fix CodeQL * Review edits * Fix output tests for PyTorch backend * Allow uncompressed conda execution enviroments (triton-inference-server#6005) Add test for uncompressed conda execution enviroments * Fix implicit state test (triton-inference-server#6039) * Adding target_compile_features cxx_std_17 to tracing lib (triton-inference-server#6040) * Update 'main' to track development of 2.37.0 / 23.08 * Fix intermittent failure in L0_model_namespacing (triton-inference-server#6052) * Fix PyTorch implicit model mounting in gen_qa_model_repository (triton-inference-server#6054) * Fix broken links pointing to the `grpc_server.cc` file (triton-inference-server#6068) * Fix L0_backend_python expected instance name (triton-inference-server#6073) * Fix expected instance name * Copyright year * Fix L0_sdk: update the search name for the client wheel (triton-inference-server#6074) * Fix name of client wheel to be looked for * Fix up * Add GitHub action to format and lint code (triton-inference-server#6022) * Add pre-commit * Fix typos, exec/shebang, formatting * Remove clang-format * Update contributing md to include pre-commit * Update spacing in CONTRIBUTING * Fix contributing pre-commit link * Link to pre-commit install directions * Wording * Restore clang-format * Fix yaml spacing * Exclude templates folder for check-yaml * Remove unused vars * Normalize spacing * Remove unused variable * Normalize config indentation * Update .clang-format to enforce max line length of 80 * Update copyrights * Update copyrights * Run workflows on every PR * Fix copyright year * Fix grammar * Entrypoint.d files are not executable * Run pre-commit hooks * Mark not executable * Run pre-commit hooks * Remove unused variable * Run pre-commit hooks after rebase * Update copyrights * Fix README.md typo (decoupled) Co-authored-by: Ryan McCormick <[email protected]> * Run pre-commit hooks * Grammar fix Co-authored-by: Ryan McCormick <[email protected]> * Redundant word Co-authored-by: Ryan McCormick <[email protected]> * Revert docker file changes * Executable shebang revert * Make model.py files non-executable * Passin is proper flag * Run pre-commit hooks on init_args/model.py * Fix typo in init_args/model.py * Make copyrights one line --------- Co-authored-by: Ryan McCormick <[email protected]> * Fix default instance name change when count is 1 (triton-inference-server#6088) * Add test for sequence model instance update (triton-inference-server#5831) * Add test for sequence model instance update * Add gap for file timestamp update * Update test for non-blocking sequence update * Update documentation * Remove mentioning increase instance count case * Add more documentaion for scheduler update test * Update test for non-blocking batcher removal * Add polling due to async scheduler destruction * Use _ as private * Fix typo * Add docs on instance count decrease * Fix typo * Separate direct and oldest to different test cases * Separate nested tests in a loop into multiple test cases * Refactor scheduler update test * Improve doc on handling future test failures * Address pre-commit * Add best effort to reset model state after a single test case failure * Remove reset model method to make harder for chaining multiple test cases as one * Remove description on model state clean up * Fix default instance name (triton-inference-server#6097) * Removing unused tests (triton-inference-server#6085) * Update post-23.07 release (triton-inference-server#6103) * Update README and versions for 2.36.0 / 23.07 * Update Dockerfile.win10.min * Fix formating issue * fix formating issue * Fix whitespaces * Fix whitespaces * Fix whitespaces * Improve asyncio testing (triton-inference-server#6122) * Reduce instance count to 1 for python bls model loading test (triton-inference-server#6130) * Reduce instance count to 1 for python bls model loading test * Add comment when calling unload * Fix queue test to expect exact number of failures (triton-inference-server#6133) * Fix queue test to expect exact number of failures * Increase the execution time to more accurately capture requests * Add CPU & GPU metrics in Grafana dashboard.json for K8s op prem deployment (fix triton-inference-server#6047) (triton-inference-server#6100) Signed-off-by: Xiaodong Ye <[email protected]> * Adding the support tracing of child models invoked from a BLS model (triton-inference-server#6063) * Adding tests for bls * Added fixme, cleaned previous commit * Removed unused imports * Fixing commit tree: Refactor code, so that OTel tracer provider is initialized only once Added resource cmd option, testig Added docs * Clean up * Update docs/user_guide/trace.md Co-authored-by: Ryan McCormick <[email protected]> * Revision * Update doc * Clean up * Added ostream exporter to OpenTelemetry for testing purposes; refactored trace tests * Added opentelemetry trace collector set up to tests; refactored otel exporter tests to use OTel collector instead of netcat * Revising according to comments * Added comment regarding 'parent_span_id' * Added permalink * Adjusted test --------- Co-authored-by: Ryan McCormick <[email protected]> * Test python environments 3.8-3.11 (triton-inference-server#6109) Add tests for python 3.8-3.11 for L0_python_backends * Improve L0_backend_python debugging (triton-inference-server#6157) * Improve L0_backend_python debugging * Use utils function for artifacts collection * Add unreachable output test for reporting source of disconnectivity (triton-inference-server#6149) * Update 'main' to track development of 2.38.0 / 23.09 (triton-inference-server#6163) * Fix the versions in the doc (triton-inference-server#6164) * Update docs with NVAIE messaging (triton-inference-server#6162) Update docs with NVAIE messaging * Add sanity tests for parallel instance loading (triton-inference-server#6126) * Remove extra whitespace (triton-inference-server#6174) * Remove a test case that sanity checks input value of --shape CLI flag (triton-inference-server#6140) * Remove test checking for --shape option * Remove the entire test * Add test when unload/load requests for same model is received at the same time (triton-inference-server#6150) * Add test when unload/load requests for same model received the same time * Add test_same_model_overlapping_load_unload * Use a load/unload stress test instead * Pre-merge test name update * Address pre-commit error * Revert "Address pre-commit error" This reverts commit 781cab1. * Record number of occurrence of each exception * Make assert failures clearer in L0_trt_plugin (triton-inference-server#6166) * Add end-to-end CI test for decoupled model support (triton-inference-server#6131) (triton-inference-server#6184) * Add end-to-end CI test for decoupled model support * Address feedback * Test preserve_ordering for oldest strategy sequence batcher (triton-inference-server#6185) * added debugging guide (triton-inference-server#5924) * added debugging guide * Run pre-commit --------- Co-authored-by: David Yastremsky <[email protected]> * Add deadlock gdb section to debug guide (triton-inference-server#6193) * Fix character escape in model repository documentation (triton-inference-server#6197) * Fix docs test (triton-inference-server#6192) * Add utility functions for array manipulation (triton-inference-server#6203) * Add utility functions for outlier removal * Fix functions * Add newline to end of file * Add gc collect to make sure gpu tensor is deallocated (triton-inference-server#6205) * Testing: add gc collect to make sure gpu tensor is deallocated * Address comment * Check for log error on failing to find explicit load model (triton-inference-server#6204) * Set default shm size to 1MB for Python backend (triton-inference-server#6209) * Trace Model Name Validation (triton-inference-server#6199) * Initial commit * Cleanup using new standard formatting * QA test restructuring * Add newline to the end of test.sh * HTTP/GRCP protocol changed to pivot on ready status & error status. Log file name changed in qa test. * Fixing unhandled error memory leak * Handle index function memory leak fix * Fix the check for error message (triton-inference-server#6226) * Fix copyright for debugging guide (triton-inference-server#6225) * Add watts units to GPU power metric descriptions (triton-inference-server#6242) * Update post-23.08 release (triton-inference-server#6234) * CUDA 12.1 > 12.2 * DLIS-5208: onnxruntime+windows - stop treat warnings on compile as errors * Revert "DLIS-5208: onnxruntime+windows - stop treat warnings on compile as errors" This reverts commit 0cecbb7. * Update Dockerfile.win10.min * Update Dockerfile.win10.min * Update README and versions for 23.08 branch * Update Dockerfile.win10 * Fix the versions in docs * Add the note about stabilization of the branch * Update docs with NVAIE messaging (triton-inference-server#6162) (triton-inference-server#6167) Update docs with NVAIE messaging Co-authored-by: David Zier <[email protected]> * Resolve merge conflict --------- Co-authored-by: tanmayv25 <[email protected]> Co-authored-by: David Zier <[email protected]> * Add tests/docs for queue size (pending request count) metric (triton-inference-server#6233) * Adding safe string to number conversions (triton-inference-server#6173) * Added catch for out of range error for trace setting update * Added wrapper to safe parse options * Added option names to errors * Adjustments * Quick fix * Fixing option name for Windows * Removed repetitive code * Adjust getopt_long for Windows to use longindex * Moved try catch into ParseOption * Removed unused input * Improved names * Refactoring and clean up * Fixed Windows * Refactored getopt_long for Windows * Refactored trace test, pinned otel's collector version to avoid problems with go requirements * Test Python execute() to return Triton error code (triton-inference-server#6228) * Add test for Python execute error code * Add all supported error codes into test * Move ErrorCode into TritonError * Expose ErrorCode internal in TritonError * Add docs on IPv6 (triton-inference-server#6262) * Add test for TensorRT version-compatible model support (triton-inference-server#6255) * Add tensorrt version-compatibility test * Generate one version-compatible model * Fix copyright year * Remove unnecessary variable * Remove unnecessary line * Generate TRT version-compatible model * Add sample inference to TRT version-compatible test * Clean up utils and model gen for new plan model * Fix startswith capitalization * Remove unused imports * Remove unused imports * Add log check * Upgrade protobuf version (triton-inference-server#6268) * Add testing for retrieving shape and datatype in backend API (triton-inference-server#6231) Add testing for retrieving output shape and datatype info from backend API * Update 'main' to track development of 2.39.0 / 23.10 (triton-inference-server#6277) * Apply UCX workaround (triton-inference-server#6254) * Add ensemble parameter forwarding test (triton-inference-server#6284) * Exclude extra TRT version-compatible models from tests (triton-inference-server#6294) * Exclude compatible models from tests. * Force model removal, in case it does not exist Co-authored-by: Ryan McCormick <[email protected]> --------- Co-authored-by: Ryan McCormick <[email protected]> * Adding installation of docker and docker-buildx (triton-inference-server#6299) * Adding installation of docker and docker-buildx * remove whitespace * Use targetmodel from header as model name in SageMaker (triton-inference-server#6147) * Use targetmodel from header as model name in SageMaker * Update naming for model hash * Add more error messages, return codes, and refactor HTTP server (triton-inference-server#6297) * Fix typo (triton-inference-server#6318) * Update the request re-use example (triton-inference-server#6283) * Update the request re-use example * Review edit * Review comment * Disable developer tools build for In-process API + JavaCPP tests (triton-inference-server#6296) * Add Python binding build. Add L0_python_api to test Python binding (triton-inference-server#6319) * Add L0_python_api to test Python binding * Install Python API in CI image * Fix QA build * Increase network timeout for valgrind (triton-inference-server#6324) * Tests and docs for ability to specify subdirectory to download for LocalizePath (triton-inference-server#6308) * Added custom localization tests for s3 and azure, added docs * Refactor HandleInfer into more readable chunks (triton-inference-server#6332) * Refactor model generation scripts (triton-inference-server#6336) * Refactor model generation scripts * Fix codeql * Fix relative path import * Fix package structure * Copy the gen_common file * Add missing uint8 * Remove duplicate import * Add testing for scalar I/O in ORT backend (triton-inference-server#6343) * Add testing for scalar I/O in ORT backend * Review edit * ci * Update post-23.09 release (triton-inference-server#6367) * Update README and versions for 23.09 branch (triton-inference-server#6280) * Update `Dockerfile` and `build.py` (triton-inference-server#6281) * Update configuration for Windows Dockerfile (triton-inference-server#6256) * Adding installation of docker and docker-buildx * Enable '--expt-relaxed-constexpr' flag for custom ops models * Upate Dockerfile version * Disable unit tests for Jetson * Update condition (triton-inference-server#6285) * removing Whitespaces (triton-inference-server#6293) * removing Whitespaces * removing whitespaces * Add security policy (triton-inference-server#6376) * Adding client-side request cancellation support and testing (triton-inference-server#6383) * Add L0_request_cancellation (triton-inference-server#6252) * Add L0_request_cancellation * Remove unittest test * Add cancellation to gRPC server error handling * Fix up * Use identity model * Add tests for gRPC client-side cancellation (triton-inference-server#6278) * Add tests for gRPC client-side cancellation * Fix CodeQL issues * Formatting * Update qa/L0_client_cancellation/client_cancellation_test.py Co-authored-by: Ryan McCormick <[email protected]> * Move to L0_request_cancellation * Address review comments * Removing request cancellation support from asyncio version * Format * Update copyright * Remove tests * Handle cancellation notification in gRPC server (triton-inference-server#6298) * Handle cancellation notification in gRPC server * Fix the request ptr initialization * Update src/grpc/infer_handler.h Co-authored-by: Ryan McCormick <[email protected]> * Address review comment * Fix logs * Fix request complete callback by removing reference to state * Improve documentation --------- Co-authored-by: Ryan McCormick <[email protected]> --------- Co-authored-by: Ryan McCormick <[email protected]> * Fixes on the gRPC frontend to handle AsyncNotifyWhenDone() API (triton-inference-server#6345) * Fix segmentation fault in gRPC frontend * Finalize all states upon completion * Fixes all state cleanups * Handle completed states when cancellation notification is received * Add more documentation steps * Retrieve dormant states to minimize the memory footprint for long streams * Update src/grpc/grpc_utils.h Co-authored-by: Ryan McCormick <[email protected]> * Use a boolean state instead of raw pointer --------- Co-authored-by: Ryan McCormick <[email protected]> * Add L0_grpc_state_cleanup test (triton-inference-server#6353) * Add L0_grpc_state_cleanup test * Add model file in QA container * Fix spelling * Add remaining subtests * Add failing subtests * Format fixes * Fix model repo * Fix QA docker file * Remove checks for the error message when shutting down server * Fix spelling * Address review comments * Add schedulers request cancellation tests (triton-inference-server#6309) * Add schedulers request cancellation tests * Merge gRPC client test * Reduce testing time and covers cancelling other requests as a consequence of request cancellation * Add streaming request cancellation test --------- Co-authored-by: Iman Tabrizian <[email protected]> Co-authored-by: Ryan McCormick <[email protected]> Co-authored-by: Jacky <[email protected]> * Add missing copyright (triton-inference-server#6388) * Add basic generate endpoints for LLM tasks (triton-inference-server#6366) * PoC of parsing request prompt and converting to Triton infer request * Remove extra trace * Add generate endpoint * Enable streaming version * Fix bug * Fix up * Add basic testing. Cherry pick from triton-inference-server#6369 * format * Address comment. Fix build * Minor cleanup * cleanup syntax * Wrap error in SSE format * Fix up * Restrict number of response on non-streaming generate * Address comment on implementation. * Re-enable trace on generate endpoint * Add more comprehensive llm endpoint tests (triton-inference-server#6377) * Add security policy (triton-inference-server#6376) * Start adding some more comprehensive tests * Fix test case * Add response error testing * Complete test placeholder * Address comment * Address comments * Fix code check --------- Co-authored-by: dyastremsky <[email protected]> Co-authored-by: GuanLuo <[email protected]> * Address comment * Address comment * Address comment * Fix typo --------- Co-authored-by: Ryan McCormick <[email protected]> Co-authored-by: dyastremsky <[email protected]> * Add Python backend request cancellation test (triton-inference-server#6364) * Add cancelled response status test * Add Python backend request cancellation test * Add Python backend decoupled request cancellation test * Simplified response if cancelled * Test response_sender.send() after closed * Rollback test response_sender.send() after closed * Rollback non-decoupled any response on cancel * Add TRT-LLM backend build to Triton (triton-inference-server#6365) (triton-inference-server#6392) * Add TRT-LLM backend build to Triton (triton-inference-server#6365) * Add trtllm backend to build * Temporarily adding version map for 23.07 * Fix build issue * Update comment * Comment out python binding changes * Add post build * Update trtllm backend naming * Update TRTLLM base image * Fix cmake arch * Revert temp changes for python binding PR * Address comment * Move import to the top (triton-inference-server#6395) * Move import to the top * pre commit format * Add Python backend when vLLM backend built (triton-inference-server#6397) * Update build.py to build vLLM backend (triton-inference-server#6394) * Support parameters object in generate route * Update 'main' to track development of 2.40.0 / 23.11 (triton-inference-server#6400) * Fix L0_sdk (triton-inference-server#6387) * Add documentation on request cancellation (triton-inference-server#6403) * Add documentation on request cancellation * Include python backend * Update docs/user_guide/request_cancellation.md Co-authored-by: Iman Tabrizian <[email protected]> * Update docs/user_guide/request_cancellation.md Co-authored-by: Neelay Shah <[email protected]> * Update docs/README.md Co-authored-by: Neelay Shah <[email protected]> * Update docs/user_guide/request_cancellation.md Co-authored-by: Ryan McCormick <[email protected]> * Remove inflight term from the main documentation * Address review comments * Fix * Update docs/user_guide/request_cancellation.md Co-authored-by: Jacky <[email protected]> * Fix --------- Co-authored-by: Iman Tabrizian <[email protected]> Co-authored-by: Neelay Shah <[email protected]> Co-authored-by: Ryan McCormick <[email protected]> Co-authored-by: Jacky <[email protected]> * Fixes in request cancellation doc (triton-inference-server#6409) * Document generate HTTP endpoint (triton-inference-server#6412) * Document generate HTTP endpoint * Address comment * Fix up * format * Address comment * Update SECURITY.md to not display commented copyright (triton-inference-server#6426) * Fix missing library in L0_data_compression (triton-inference-server#6424) * Fix missing library in L0_data_compression * Fix up * Add Javacpp-presets repo location as env variable in Java tests(triton-inference-server#6385) Simplify testing when upstream (javacpp-presets) build changes. Related to triton-inference-server/client#409 * TRT-LLM backend build changes (triton-inference-server#6406) * Update url * Debugging * Debugging * Update url * Fix build for TRT-LLM backend * Remove TRTLLM TRT and CUDA versions * Fix up unused var * Fix up dir name * FIx cmake patch * Remove previous TRT version * Install required packages for example models * Remove packages that are only needed for testing * Add gRPC AsyncIO request cancellation tests (triton-inference-server#6408) * Fix gRPC test failure and refactor * Add gRPC AsyncIO cancellation tests * Better check if a request is cancelled * Use f-string * Fix L0_implicit_state (triton-inference-server#6427) * Fixing vllm build (triton-inference-server#6433) * Fixing torch version for vllm * Switch Jetson model TensorRT models generation to container (triton-inference-server#6378) * Switch Jetson model TensorRT models generation to container * Adding missed file * Fix typo * Fix typos * Remove extra spaces * Fix typo * Bumped vllm version (triton-inference-server#6444) * Adjust test_concurrent_same_model_load_unload_stress (triton-inference-server#6436) * Adding emergency vllm latest release (triton-inference-server#6454) * Fix notify state destruction and inflight states tracking (triton-inference-server#6451) * Ensure notify_state_ gets properly destructed * Fix inflight state tracking to properly erase states * Prevent removing the notify_state from being erased * Wrap notify_state_ object within unique_ptr * Update TRT-LLM backend url (triton-inference-server#6455) * TRTLLM backend post release * TRTLLM backend post release * Update submodule url for permission issue * Update submodule url * Fix up * Not using postbuild function to workaround submodule url permission issue * Added docs on python based backends (triton-inference-server#6429) Co-authored-by: Neelay Shah <[email protected]> * L0_model_config Fix (triton-inference-server#6472) * Minor fix for L0_model_config * Add test for Python model parameters (triton-inference-server#6452) * Test Python BLS with different sizes of CUDA memory pool (triton-inference-server#6276) * Test with different sizes of CUDA memory pool * Check the server log for error message * Improve debugging * Fix syntax * Add documentation for K8s-onprem StartupProbe (triton-inference-server#5257) Co-authored-by: dyastremsky <[email protected]> Co-authored-by: Ryan McCormick <[email protected]> * Update `main` post-23.10 release (triton-inference-server#6484) * Update README and versions for 23.10 branch (triton-inference-server#6399) * Cherry-picking vLLM backend changes (triton-inference-server#6404) * Update build.py to build vLLM backend (triton-inference-server#6394) * Add Python backend when vLLM backend built (triton-inference-server#6397) --------- Co-authored-by: dyastremsky <[email protected]> * Add documentation on request cancellation (triton-inference-server#6403) (triton-inference-server#6407) * Add documentation on request cancellation * Include python backend * Update docs/user_guide/request_cancellation.md * Update docs/user_guide/request_cancellation.md * Update docs/README.md * Update docs/user_guide/request_cancellation.md * Remove inflight term from the main documentation * Address review comments * Fix * Update docs/user_guide/request_cancellation.md * Fix --------- Co-authored-by: Iman Tabrizian <[email protected]> Co-authored-by: Neelay Shah <[email protected]> Co-authored-by: Ryan McCormick <[email protected]> Co-authored-by: Jacky <[email protected]> * Fixes in request cancellation doc (triton-inference-server#6409) (triton-inference-server#6410) * TRT-LLM backend build changes (triton-inference-server#6406) (triton-inference-server#6430) * Update url * Debugging * Debugging * Update url * Fix build for TRT-LLM backend * Remove TRTLLM TRT and CUDA versions * Fix up unused var * Fix up dir name * FIx cmake patch * Remove previous TRT version * Install required packages for example models * Remove packages that are only needed for testing * Fixing vllm build (triton-inference-server#6433) (triton-inference-server#6437) * Fixing torch version for vllm Co-authored-by: Olga Andreeva <[email protected]> * Update TRT-LLM backend url (triton-inference-server#6455) (triton-inference-server#6460) * TRTLLM backend post release * TRTLLM backend post release * Update submodule url for permission issue * Update submodule url * Fix up * Not using postbuild function to workaround submodule url permission issue * remove redundant lines * Revert "remove redundant lines" This reverts commit 86be7ad. * restore missed lines * Update build.py Co-authored-by: Olga Andreeva <[email protected]> * Update build.py Co-authored-by: Olga Andreeva <[email protected]> --------- Co-authored-by: Tanmay Verma <[email protected]> Co-authored-by: dyastremsky <[email protected]> Co-authored-by: Iman Tabrizian <[email protected]> Co-authored-by: Neelay Shah <[email protected]> Co-authored-by: Ryan McCormick <[email protected]> Co-authored-by: Jacky <[email protected]> Co-authored-by: Kris Hung <[email protected]> Co-authored-by: Katherine Yang <[email protected]> Co-authored-by: Olga Andreeva <[email protected]> * Adding structure reference to the new document (triton-inference-server#6493) * Improve L0_backend_python test stability (ensemble / gpu_tensor_lifecycle) (triton-inference-server#6490) * Test torch allocator gpu memory usage directly rather than global gpu memory for more consistency * Add L0_generative_sequence test (triton-inference-server#6475) * Add testing backend and test * Add test to build / CI. Minor fix on L0_http * Format. Update backend documentation * Fix up * Address comment * Add negative testing * Fix up * Downgrade vcpkg version (triton-inference-server#6503) * Collecting sub dir artifacts in GitLab yaml. Removing collect function from test script. (triton-inference-server#6499) * Use post build function for TRT-LLM backend (triton-inference-server#6476) * Use postbuild function * Remove updating submodule url * Enhanced python_backend autocomplete (triton-inference-server#6504) * Added testing for python_backend autocomplete: optional input and model_transaction_policy * Parse reuse-grpc-port and reuse-http-port as booleans (triton-inference-server#6511) Co-authored-by: Francesco Petrini <[email protected]> * Fixing L0_io (triton-inference-server#6510) * Fixing L0_io * Add Python-based backends CI (triton-inference-server#6466) * Bumped vllm version * Add python-bsed backends testing * Add python-based backends CI * Fix errors * Add vllm backend * Fix pre-commit * Modify test.sh * Remove vllm_opt qa model * Remove vLLM ackend tests * Resolve review comments * Fix pre-commit errors * Update qa/L0_backend_python/python_based_backends/python_based_backends_test.py Co-authored-by: Tanmay Verma <[email protected]> * Remove collect_artifacts_from_subdir function call --------- Co-authored-by: oandreeva-nv <[email protected]> Co-authored-by: Tanmay Verma <[email protected]> * Enabling option to restrict access to HTTP APIs based on header value pairs (similar to gRPC) * Upgrade DCGM from 2.4.7 to 3.2.6 (triton-inference-server#6515) * Enhance GCS credentials documentations (triton-inference-server#6526) * Test file override outside of model directory (triton-inference-server#6516) * Add boost-filesystem * Update ORT version to 1.16.2 (triton-inference-server#6531) * Adjusting expected error msg (triton-inference-server#6517) * Update 'main' to track development of 2.41.0 / 23.12 (triton-inference-server#6543) * Enhance testing for pending request count (triton-inference-server#6532) * Enhance testing for pending request count * Improve the documentation * Add more documentation * Add testing for Python backend request rescheduling (triton-inference-server#6509) * Add testing * Fix up * Enhance testing * Fix up * Revert test changes * Add grpc endpoint test * Remove unused import * Remove unused import * Update qa/L0_backend_python/request_rescheduling/grpc_endpoint_test.py Co-authored-by: Iman Tabrizian <[email protected]> * Update qa/python_models/bls_request_rescheduling/model.py Co-authored-by: Iman Tabrizian <[email protected]> --------- Co-authored-by: Iman Tabrizian <[email protected]> * Check that the wget is installed (triton-inference-server#6556) * secure deployment considerations guide (triton-inference-server#6533) * draft document * updates * updates * updated * updates * updates * updates * updates * updates * updates * updates * updates * updates * updates * updates * updates * updates * updates * updates * updates * updates * updates * updates * updates * update * updates * updates * Update docs/customization_guide/deploy.md Co-authored-by: Kyle McGill <[email protected]> * Update docs/customization_guide/deploy.md Co-authored-by: Kyle McGill <[email protected]> * fixing typos * updated with clearer warnings * updates to readme and toc --------- Co-authored-by: Kyle McGill <[email protected]> * Fix typo and change the command line order (triton-inference-server#6557) * Fix typo and change the command line order * Improve visual experience. Add 'clang' package * Add error during rescheduling test to L0_generative_sequence (triton-inference-server#6550) * changing references to concrete instances * Add testing for implicit state enhancements (triton-inference-server#6524) * Add testing for single buffer * Add testing for implicit state with buffer growth * Improve testing * Fix up * Add CUDA virtual address size flag * Add missing test files * Parameter rename * Test fixes * Only build implicit state backend for GPU=ON * Fix copyright (triton-inference-server#6584) * Mention TRT LLM backend supports request cancellation (triton-inference-server#6585) * update model repository generation for onnx models for protobuf (triton-inference-server#6575) * Fix L0_sagemaker (triton-inference-server#6587) * Add C++ server wrapper to the doc (triton-inference-server#6592) * Add timeout to client apis and tests (triton-inference-server#6546) Client PR: triton-inference-server/client#429 * Change name generative -> iterative (triton-inference-server#6601) * name changes * updated names * Add documentation on generative sequence (triton-inference-server#6595) * Add documentation on generative sequence * Address comment * Reflect the "iterative" change * Updated description of iterative sequences * Restricted HTTP API documentation Co-authored-by: Ryan McCormick <[email protected]> * Add request cancellation and debugging guide to generated docs (triton-inference-server#6617) * Support for http request cancellation. Includes fix for seg fault in generate_stream endpoint. * Bumped vLLM version to v0.2.2 (triton-inference-server#6623) * Upgrade ORT version (triton-inference-server#6618) * Use compliant preprocessor (triton-inference-server#6626) * Update README.md (triton-inference-server#6627) * Extend request objects lifetime and fixes possible segmentation fault (triton-inference-server#6620) * Extend request objects lifetime * Remove explicit TRITONSERVER_InferenceRequestDelete * Format fix * Include the inference_request_ initialization to cover RequestNew --------- Co-authored-by: Neelay Shah <[email protected]> * Update protobuf after python update for testing (triton-inference-server#6638) This fixes the issue where python client has `AttributeError: 'NoneType' object has no attribute 'enum_types_by_name' errors after python version is updated. * Update post-23.11 release (triton-inference-server#6653) * Update README and versions for 2.40.0 / 23.11 (triton-inference-server#6544) * Removing path construction to use SymLink alternatives * Update version for PyTorch * Update windows Dockerfile configuration * Update triton version to 23.11 * Update README and versions for 2.40.0 / 23.11 * Fix typo * Ading 'ldconfig' to configure dynamic linking in container (triton-inference-server#6602) * Point to tekit_backend (triton-inference-server#6616) * Point to tekit_backend * Update version * Revert tekit changes (triton-inference-server#6640) --------- Co-authored-by: Kris Hung <[email protected]> * PYBE Timeout Tests (triton-inference-server#6483) * New testing to confirm large request timeout values can be passed and retrieved within Python BLS models. * Add note on lack of ensemble support (triton-inference-server#6648) * Added request id to span attributes (triton-inference-server#6667) * Add test for optional internal tensor within an ensemble (triton-inference-server#6663) * Add test for optional internal tensor within an ensemble * Fix up * Set CMake version to 3.27.7 (triton-inference-server#6675) * Set CMake version to 3.27.7 * Set CMake version to 3.27.7 * Fix double slash typo * restore typo (triton-inference-server#6680) * Update 'main' to track development of 2.42.0 / 24.01 (triton-inference-server#6673) * iGPU build refactor (triton-inference-server#6684) (triton-inference-server#6691) * Mlflow Plugin Fix (triton-inference-server#6685) * Mlflow plugin fix * Fix extra content-type headers in HTTP server (triton-inference-server#6678) * Fix iGPU CMakeFile tags (triton-inference-server#6695) * Unify iGPU test build with x86 ARM * adding TRITON_IGPU_BUILD to core build definition; adding logic to skip caffe2plan test if TRITON_IGPU_BUILD=1 * re-organizing some copies in Dockerfile.QA to fix igpu devel build * Pre-commit fix --------- Co-authored-by: kyle <[email protected]> * adding default value for TRITON_IGPU_BUILD=OFF (triton-inference-server#6705) * adding default value for TRITON_IGPU_BUILD=OFF * fix newline --------- Co-authored-by: kyle <[email protected]> * Add test case for decoupled model raising exception (triton-inference-server#6686) * Add test case for decoupled model raising exception * Remove unused import * Address comment * Escape special characters in general docs (triton-inference-server#6697) * vLLM Benchmarking Test (triton-inference-server#6631) * vLLM Benchmarking Test * Allow configuring GRPC max connection age and max connection age grace (triton-inference-server#6639) * Add ability to configure GRPC max connection age and max connection age grace * Allow pass GRPC connection age args when they are set from command ---------- Co-authored-by: Katherine Yang <[email protected]> --------- Signed-off-by: Xiaodong Ye <[email protected]> Co-authored-by: Olga Andreeva <[email protected]> Co-authored-by: GuanLuo <[email protected]> Co-authored-by: Neelay Shah <[email protected]> Co-authored-by: Tanmay Verma <[email protected]> Co-authored-by: Kris Hung <[email protected]> Co-authored-by: Jacky <[email protected]> Co-authored-by: Ryan McCormick <[email protected]> Co-authored-by: dyastremsky <[email protected]> Co-authored-by: Katherine Yang <[email protected]> Co-authored-by: Iman Tabrizian <[email protected]> Co-authored-by: Gerard Casas Saez <[email protected]> Co-authored-by: Misha Chornyi <[email protected]> Co-authored-by: R0CKSTAR <[email protected]> Co-authored-by: Elias Bermudez <[email protected]> Co-authored-by: ax-vivien <[email protected]> Co-authored-by: Neelay Shah <[email protected]> Co-authored-by: nv-kmcgill53 <[email protected]> Co-authored-by: Matthew Kotila <[email protected]> Co-authored-by: Nikhil Kulkarni <[email protected]> Co-authored-by: Misha Chornyi <[email protected]> Co-authored-by: Iman Tabrizian <[email protected]> Co-authored-by: David Yastremsky <[email protected]> Co-authored-by: Timothy Gerdes <[email protected]> Co-authored-by: Mate Mijolović <[email protected]> Co-authored-by: David Zier <[email protected]> Co-authored-by: Hyunjae Woo <[email protected]> Co-authored-by: Tanay Varshney <[email protected]> Co-authored-by: Francesco Petrini <[email protected]> Co-authored-by: Dmitry Mironov <[email protected]> Co-authored-by: Ryan McCormick <[email protected]> Co-authored-by: Sai Kiran Polisetty <[email protected]> Co-authored-by: oandreeva-nv <[email protected]> Co-authored-by: kyle <[email protected]> Co-authored-by: Neal Vaidya <[email protected]> Co-authored-by: siweili11 <[email protected]>

Improve performance tuning guide

8191b9d

Tabrizian requested a review from rmccorm4 July 5, 2023 21:22

debermudez approved these changes Jul 5, 2023

View reviewed changes

Tabrizian merged commit 20d6bb2 into main Jul 6, 2023

Tabrizian deleted the imant-docs branch July 6, 2023 16:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance tuning guide #6026

Improve performance tuning guide #6026

Tabrizian commented Jul 5, 2023 •

edited

Loading

Improve performance tuning guide #6026

Improve performance tuning guide #6026

Conversation

Tabrizian commented Jul 5, 2023 • edited Loading

Tabrizian commented Jul 5, 2023 •

edited

Loading