OpenVINO integration improvements (#60)

* more tuning options * fix * async inference to increae throughput * drop unused OV building parameter and switch to TBB threading lib * add examples in the documetnation
triton-inference-server · Oct 16, 2023 · 71ffc44 · 71ffc44
1 parent ac8cdda
commit 71ffc44
Show file tree

Hide file tree

Showing 3 changed files with 98 additions and 55 deletions.
diff --git a/README.md b/README.md
@@ -77,40 +77,19 @@ but the listed CMake argument can be used to override.
 
 Configuration of OpenVINO for a model is done through the Parameters section of the model's 'config.pbtxt' file. The parameters and their description are as follows.
 
+* `PERFORMANCE_HINT`: Presetting performance tunning options. Accepted values `LATENCY` for low concurrency use case and `THROUGHPUT` for high concurrency scenarios.
 * `CPU_EXTENSION_PATH`: Required for CPU custom layers. Absolute path to a shared library with the kernels implementations.
-* `INFERENCE_NUM_THREADS`: Maximum number of threads that can be used for inference tasks. Should be a non-negative number.
+* `INFERENCE_NUM_THREADS`: Maximum number of threads that can be used for inference tasks. Should be a non-negative number. Default is equal to number of cores.
 * `COMPILATION_NUM_THREADS`: Maximum number of threads that can be used for compilation tasks. Should be a non-negative number.
 * `HINT_BF16`: Hint for device to use bfloat16 precision for inference. Possible value is `YES`.
-* `NUM_STREAMS`: The number of executor logical partitions. Set the value to `AUTO` to creates bare minimum of streams to improve the performance, or set the value to `NUMA` to creates as many streams as needed to accommodate NUMA and avoid associated penalties.
+* `NUM_STREAMS`: The number of executor logical partitions. Set the value to `AUTO` to creates bare minimum of streams to improve the performance, or set the value to `NUMA` to creates as many streams as needed to accommodate NUMA and avoid associated penalties. Set a numerical value to set explicit number of streams.
 * `SKIP_OV_DYNAMIC_BATCHSIZE`: The topology of some models do not support openVINO dynamic batch sizes. Set the value of this parameter to `YES`, in order
 to skip the dynamic batch sizes in backend.
 * `ENABLE_BATCH_PADDING`: By default an error will be generated if backend receives a request with batch size less than max_batch_size specified in the configuration. This error can be avoided at a cost of performance by specifying `ENABLE_BATCH_PADDING` parameter as `YES`.
 * `RESHAPE_IO_LAYERS`: By setting this parameter as `YES`, the IO layers are reshaped to the dimensions provided in
 model configuration. By default, the dimensions in the model is used.
 
-The section of model config file specifying these parameters will look like:
 
-```
-.
-.
-.
-parameters: {
-key: "NUM_STREAMS"
-value: {
-string_value:"NUMA"
-}
-}
-parameters: {
-key: "INFERENCE_NUM_THREADS"
-value: {
-string_value:"5"
-}
-}
-.
-.
-.
-
-```
 
 ## Auto-Complete Model Configuration
 
@@ -157,10 +136,77 @@ and
 [`sequence_batching`](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_configuration.md#sequence-batcher)
 is provided, then `dynamic_batching` will be enabled with default settings.
 
+
+### Examples of the "config.pbtxt" files depending on the use case
+
+Latency mode with low concurrency on the client side. Recommended for performance optimization with low number of parallel clients.
+```
+parameters: [
+{
+   key: "NUM_STREAMS"
+   value: {
+     string_value: "1"
+   }
+},
+{
+   key: "PERFORMANCE_HINT"
+   value: {
+     string_value: "LATENCY"
+   }
+}
+]
+```
+
+Throughput mode with high concurrency on the client side. Recommended for throughput optimization with high number of parallel clients.
+Number of streams should be lower or equal to number of parallel clients and lower of equal to the number of CPU cores.
+For example, with ~20 clients on the host with 12 CPU cores, the config could be like:
+```
+instance_group [
+    {
+      count: 12
+      kind: KIND_CPU
+    }
+  ]
+parameters: [
+{
+   key: "NUM_STREAMS"
+   value: {
+     string_value: "12"
+   }
+}
+]
+```
+
+When loading model with the non default format of Intermediate Representation and the name model.xml, use and extra parameter "default_model_filename".
+For example, using TensorFlow saved_model format use:
+```
+default_model_filename: "model.saved_model"
+parameters: [
+{
+   key: "PERFORMANCE_HINT"
+   value: {
+     string_value: "LATENCY"
+   }
+}
+]
+```
+and copy the model to the subfolder called "model.saved_model"
+```
+model_repository/
+└── model
+    ├── 1
+    │   └── model.saved_model
+    │       ├── saved_model.pb
+    │       └── variables
+    └── config.pbtxt
+
+```
+
+
 ## Known Issues
 
-* Not all models support dynamic batch sizes.
+* Models with dynamic shape are not supported in this backend now.
 
 * As of now, the Openvino backend does not support variable shaped tensors. However, the dynamic batch sizes in the model are supported. See `SKIP_OV_DYNAMIC_BATCHSIZE` and `ENABLE_BATCH_PADDING` parameters for more details.
 
-* Openvino does not support CPU execution for FP16.
+* Models with the scalar on the input (shape without any dimension are not supported)
diff --git a/src/openvino.cc b/src/openvino.cc
@@ -265,6 +265,7 @@ ModelState::ParseParameters(const std::string& device)
           ParseParameter("COMPILATION_NUM_THREADS", params, &device_config));
       RETURN_IF_ERROR(ParseParameter("HINT_BF16", params, &device_config));
       RETURN_IF_ERROR(ParseParameter("NUM_STREAMS", params, &device_config));
+      RETURN_IF_ERROR(ParseParameter("PERFORMANCE_HINT", params, &device_config));
     }
   }
 
@@ -368,14 +369,32 @@ ModelState::ParseParameterHelper(
       *ov_property = ov::streams::num(ov::streams::AUTO);
     } else if (value->compare("numa") == 0) {
       *ov_property = ov::streams::num(ov::streams::NUMA);
-    } else {
+    } else if (IsNumber(*value)){
+      *ov_property = ov::streams::num(std::stoi(*value));
+    }
+    else{
       return TRITONSERVER_ErrorNew(
           TRITONSERVER_ERROR_INVALID_ARG,
           (std::string("expected the parameter '") + mkey +
-           "' to be either AUTO/NUMA, got " + *value)
+           "' to be either AUTO/NUMA/<int_value>, got " + *value)
               .c_str());
     }
-  } else {
+  } else if (mkey.compare("PERFORMANCE_HINT") == 0) {
+    if (value->compare("latency") == 0) {
+      *ov_property = ov::hint::performance_mode(ov::hint::PerformanceMode::LATENCY);
+    } else if (value->compare("throughput") == 0) {
+      *ov_property = ov::hint::performance_mode(ov::hint::PerformanceMode::THROUGHPUT);
+    } else if (value->compare("cumulative_throughput") == 0) {
+      *ov_property = ov::hint::performance_mode(ov::hint::PerformanceMode::CUMULATIVE_THROUGHPUT);
+    } else {
+       return TRITONSERVER_ErrorNew(
+          TRITONSERVER_ERROR_INVALID_ARG,
+          (std::string("expected the parameter '") + mkey +
+           "' to be LATENCY/THROUGHPUT/CUMULATIVE_THROUGHPUT, got " + *value)
+              .c_str());
+    }
+  }
+  else {
     return TRITONSERVER_ErrorNew(
         TRITONSERVER_ERROR_INVALID_ARG,
         (std::string("the parameter '") + mkey +
@@ -1172,7 +1191,8 @@ ModelInstanceState::Infer(
     std::vector<TRITONBACKEND_Response*>* responses,
     const uint32_t response_count)
 {
-  RETURN_IF_OPENVINO_ERROR(infer_request_.infer(), "running inference");
+  RETURN_IF_OPENVINO_ERROR(infer_request_.start_async(), "running inference");
+  infer_request_.wait();
 
   return nullptr;
 }

diff --git a/tools/gen_openvino_dockerfile.py b/tools/gen_openvino_dockerfile.py
@@ -92,16 +92,8 @@ def dockerfile_for_linux(output_file):
 RUN /bin/bash -c 'cmake \
         -DCMAKE_BUILD_TYPE=${OPENVINO_BUILD_TYPE} \
         -DCMAKE_INSTALL_PREFIX=/workspace/install \
-        -DENABLE_VPU=OFF \
-        -DENABLE_CLDNN=OFF \
-        -DTHREADING=OMP \
-        -DENABLE_GNA=OFF \
-        -DENABLE_DLIA=OFF \
         -DENABLE_TESTS=OFF \
-        -DENABLE_INTEL_MYRIAD=OFF \
         -DENABLE_VALIDATION_SET=OFF \
-        -DNGRAPH_ONNX_IMPORT_ENABLE=OFF \
-        -DNGRAPH_DEPRECATED_ENABLE=FALSE \
         .. && \
     make -j$(nproc) install'
 
@@ -111,19 +103,8 @@ def dockerfile_for_linux(output_file):
     cp -r /workspace/install/runtime/include/ngraph include/. && \
     cp -r /workspace/install/runtime/include/openvino include/.
 RUN mkdir -p lib && \
-    cp /workspace/install/runtime/lib/intel64/libiomp5.so lib/. && \
-    cp /workspace/install/runtime/lib/intel64/libopenvino.so.${OPENVINO_VERSION} lib/. && \
-    cp /workspace/install/runtime/lib/intel64/libopenvino_c.so.${OPENVINO_VERSION} lib/. && \
-    cp /workspace/install/runtime/lib/intel64/libopenvino_intel_cpu_plugin.so lib/. && \
-    cp /workspace/install/runtime/lib/intel64/libopenvino_ir_frontend.so.${OPENVINO_VERSION} lib/.
-RUN OV_SHORT_VERSION=`echo ${OPENVINO_VERSION} | awk '{ split($0,a,"."); print substr(a[1],3) a[2] a[3] }'` && \
-    (cd lib && \
-        ln -s libopenvino.so.${OPENVINO_VERSION} libopenvino.so.${OV_SHORT_VERSION} && \
-        ln -s libopenvino.so.${OPENVINO_VERSION} libopenvino.so && \
-        ln -s libopenvino_c.so.${OPENVINO_VERSION} libopenvino_c.so.${OV_SHORT_VERSION} && \
-        ln -s libopenvino_c.so.${OPENVINO_VERSION} libopenvino_c.so && \
-        ln -s libopenvino_ir_frontend.so.${OPENVINO_VERSION} libopenvino_ir_frontend.so.${OV_SHORT_VERSION} && \
-        ln -s libopenvino_ir_frontend.so.${OPENVINO_VERSION} libopenvino_ir_frontend.so)
+    cp -P /usr/lib/x86_64-linux-gnu/libtbb.so* lib/. && \
+    cp -P /workspace/install/runtime/lib/intel64/libopenvino*.so* lib/. \
 """
 
     df += """
@@ -165,11 +146,7 @@ def dockerfile_for_windows(output_file):
 ARG CMAKE_BAT="cmake \
           -DCMAKE_BUILD_TYPE=%OPENVINO_BUILD_TYPE% \
           -DCMAKE_INSTALL_PREFIX=C:/workspace/install \
-          -DENABLE_CLDNN=OFF \
           -DENABLE_TESTS=OFF \
-          -DENABLE_VALIDATION_SET=OFF \
-          -DNGRAPH_ONNX_IMPORT_ENABLE=OFF \
-          -DNGRAPH_DEPRECATED_ENABLE=FALSE \
           .."
 ARG CMAKE_BUILD_BAT="cmake --build . --config %OPENVINO_BUILD_TYPE% --target install --verbose -j8"
 RUN powershell Set-Content 'build.bat' -value '%VS_DEVCMD_BAT%','%CMAKE_BAT%','%CMAKE_BUILD_BAT%'