Skip to content

Commit

Permalink
OpenVINO integration improvements (#60)
Browse files Browse the repository at this point in the history
* more tuning options

* fix

* async inference to increae throughput

* drop unused OV building parameter and switch to TBB threading lib

* add examples in the documetnation
  • Loading branch information
dtrawins authored Oct 16, 2023
1 parent ac8cdda commit 71ffc44
Show file tree
Hide file tree
Showing 3 changed files with 98 additions and 55 deletions.
98 changes: 72 additions & 26 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,40 +77,19 @@ but the listed CMake argument can be used to override.

Configuration of OpenVINO for a model is done through the Parameters section of the model's 'config.pbtxt' file. The parameters and their description are as follows.

* `PERFORMANCE_HINT`: Presetting performance tunning options. Accepted values `LATENCY` for low concurrency use case and `THROUGHPUT` for high concurrency scenarios.
* `CPU_EXTENSION_PATH`: Required for CPU custom layers. Absolute path to a shared library with the kernels implementations.
* `INFERENCE_NUM_THREADS`: Maximum number of threads that can be used for inference tasks. Should be a non-negative number.
* `INFERENCE_NUM_THREADS`: Maximum number of threads that can be used for inference tasks. Should be a non-negative number. Default is equal to number of cores.
* `COMPILATION_NUM_THREADS`: Maximum number of threads that can be used for compilation tasks. Should be a non-negative number.
* `HINT_BF16`: Hint for device to use bfloat16 precision for inference. Possible value is `YES`.
* `NUM_STREAMS`: The number of executor logical partitions. Set the value to `AUTO` to creates bare minimum of streams to improve the performance, or set the value to `NUMA` to creates as many streams as needed to accommodate NUMA and avoid associated penalties.
* `NUM_STREAMS`: The number of executor logical partitions. Set the value to `AUTO` to creates bare minimum of streams to improve the performance, or set the value to `NUMA` to creates as many streams as needed to accommodate NUMA and avoid associated penalties. Set a numerical value to set explicit number of streams.
* `SKIP_OV_DYNAMIC_BATCHSIZE`: The topology of some models do not support openVINO dynamic batch sizes. Set the value of this parameter to `YES`, in order
to skip the dynamic batch sizes in backend.
* `ENABLE_BATCH_PADDING`: By default an error will be generated if backend receives a request with batch size less than max_batch_size specified in the configuration. This error can be avoided at a cost of performance by specifying `ENABLE_BATCH_PADDING` parameter as `YES`.
* `RESHAPE_IO_LAYERS`: By setting this parameter as `YES`, the IO layers are reshaped to the dimensions provided in
model configuration. By default, the dimensions in the model is used.

The section of model config file specifying these parameters will look like:

```
.
.
.
parameters: {
key: "NUM_STREAMS"
value: {
string_value:"NUMA"
}
}
parameters: {
key: "INFERENCE_NUM_THREADS"
value: {
string_value:"5"
}
}
.
.
.
```

## Auto-Complete Model Configuration

Expand Down Expand Up @@ -157,10 +136,77 @@ and
[`sequence_batching`](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_configuration.md#sequence-batcher)
is provided, then `dynamic_batching` will be enabled with default settings.


### Examples of the "config.pbtxt" files depending on the use case

Latency mode with low concurrency on the client side. Recommended for performance optimization with low number of parallel clients.
```
parameters: [
{
key: "NUM_STREAMS"
value: {
string_value: "1"
}
},
{
key: "PERFORMANCE_HINT"
value: {
string_value: "LATENCY"
}
}
]
```

Throughput mode with high concurrency on the client side. Recommended for throughput optimization with high number of parallel clients.
Number of streams should be lower or equal to number of parallel clients and lower of equal to the number of CPU cores.
For example, with ~20 clients on the host with 12 CPU cores, the config could be like:
```
instance_group [
{
count: 12
kind: KIND_CPU
}
]
parameters: [
{
key: "NUM_STREAMS"
value: {
string_value: "12"
}
}
]
```

When loading model with the non default format of Intermediate Representation and the name model.xml, use and extra parameter "default_model_filename".
For example, using TensorFlow saved_model format use:
```
default_model_filename: "model.saved_model"
parameters: [
{
key: "PERFORMANCE_HINT"
value: {
string_value: "LATENCY"
}
}
]
```
and copy the model to the subfolder called "model.saved_model"
```
model_repository/
└── model
├── 1
│   └── model.saved_model
│   ├── saved_model.pb
│   └── variables
└── config.pbtxt
```


## Known Issues

* Not all models support dynamic batch sizes.
* Models with dynamic shape are not supported in this backend now.

* As of now, the Openvino backend does not support variable shaped tensors. However, the dynamic batch sizes in the model are supported. See `SKIP_OV_DYNAMIC_BATCHSIZE` and `ENABLE_BATCH_PADDING` parameters for more details.

* Openvino does not support CPU execution for FP16.
* Models with the scalar on the input (shape without any dimension are not supported)
28 changes: 24 additions & 4 deletions src/openvino.cc
Original file line number Diff line number Diff line change
Expand Up @@ -265,6 +265,7 @@ ModelState::ParseParameters(const std::string& device)
ParseParameter("COMPILATION_NUM_THREADS", params, &device_config));
RETURN_IF_ERROR(ParseParameter("HINT_BF16", params, &device_config));
RETURN_IF_ERROR(ParseParameter("NUM_STREAMS", params, &device_config));
RETURN_IF_ERROR(ParseParameter("PERFORMANCE_HINT", params, &device_config));
}
}

Expand Down Expand Up @@ -368,14 +369,32 @@ ModelState::ParseParameterHelper(
*ov_property = ov::streams::num(ov::streams::AUTO);
} else if (value->compare("numa") == 0) {
*ov_property = ov::streams::num(ov::streams::NUMA);
} else {
} else if (IsNumber(*value)){
*ov_property = ov::streams::num(std::stoi(*value));
}
else{
return TRITONSERVER_ErrorNew(
TRITONSERVER_ERROR_INVALID_ARG,
(std::string("expected the parameter '") + mkey +
"' to be either AUTO/NUMA, got " + *value)
"' to be either AUTO/NUMA/<int_value>, got " + *value)
.c_str());
}
} else {
} else if (mkey.compare("PERFORMANCE_HINT") == 0) {
if (value->compare("latency") == 0) {
*ov_property = ov::hint::performance_mode(ov::hint::PerformanceMode::LATENCY);
} else if (value->compare("throughput") == 0) {
*ov_property = ov::hint::performance_mode(ov::hint::PerformanceMode::THROUGHPUT);
} else if (value->compare("cumulative_throughput") == 0) {
*ov_property = ov::hint::performance_mode(ov::hint::PerformanceMode::CUMULATIVE_THROUGHPUT);
} else {
return TRITONSERVER_ErrorNew(
TRITONSERVER_ERROR_INVALID_ARG,
(std::string("expected the parameter '") + mkey +
"' to be LATENCY/THROUGHPUT/CUMULATIVE_THROUGHPUT, got " + *value)
.c_str());
}
}
else {
return TRITONSERVER_ErrorNew(
TRITONSERVER_ERROR_INVALID_ARG,
(std::string("the parameter '") + mkey +
Expand Down Expand Up @@ -1172,7 +1191,8 @@ ModelInstanceState::Infer(
std::vector<TRITONBACKEND_Response*>* responses,
const uint32_t response_count)
{
RETURN_IF_OPENVINO_ERROR(infer_request_.infer(), "running inference");
RETURN_IF_OPENVINO_ERROR(infer_request_.start_async(), "running inference");
infer_request_.wait();

return nullptr;
}
Expand Down
27 changes: 2 additions & 25 deletions tools/gen_openvino_dockerfile.py
Original file line number Diff line number Diff line change
Expand Up @@ -92,16 +92,8 @@ def dockerfile_for_linux(output_file):
RUN /bin/bash -c 'cmake \
-DCMAKE_BUILD_TYPE=${OPENVINO_BUILD_TYPE} \
-DCMAKE_INSTALL_PREFIX=/workspace/install \
-DENABLE_VPU=OFF \
-DENABLE_CLDNN=OFF \
-DTHREADING=OMP \
-DENABLE_GNA=OFF \
-DENABLE_DLIA=OFF \
-DENABLE_TESTS=OFF \
-DENABLE_INTEL_MYRIAD=OFF \
-DENABLE_VALIDATION_SET=OFF \
-DNGRAPH_ONNX_IMPORT_ENABLE=OFF \
-DNGRAPH_DEPRECATED_ENABLE=FALSE \
.. && \
make -j$(nproc) install'
Expand All @@ -111,19 +103,8 @@ def dockerfile_for_linux(output_file):
cp -r /workspace/install/runtime/include/ngraph include/. && \
cp -r /workspace/install/runtime/include/openvino include/.
RUN mkdir -p lib && \
cp /workspace/install/runtime/lib/intel64/libiomp5.so lib/. && \
cp /workspace/install/runtime/lib/intel64/libopenvino.so.${OPENVINO_VERSION} lib/. && \
cp /workspace/install/runtime/lib/intel64/libopenvino_c.so.${OPENVINO_VERSION} lib/. && \
cp /workspace/install/runtime/lib/intel64/libopenvino_intel_cpu_plugin.so lib/. && \
cp /workspace/install/runtime/lib/intel64/libopenvino_ir_frontend.so.${OPENVINO_VERSION} lib/.
RUN OV_SHORT_VERSION=`echo ${OPENVINO_VERSION} | awk '{ split($0,a,"."); print substr(a[1],3) a[2] a[3] }'` && \
(cd lib && \
ln -s libopenvino.so.${OPENVINO_VERSION} libopenvino.so.${OV_SHORT_VERSION} && \
ln -s libopenvino.so.${OPENVINO_VERSION} libopenvino.so && \
ln -s libopenvino_c.so.${OPENVINO_VERSION} libopenvino_c.so.${OV_SHORT_VERSION} && \
ln -s libopenvino_c.so.${OPENVINO_VERSION} libopenvino_c.so && \
ln -s libopenvino_ir_frontend.so.${OPENVINO_VERSION} libopenvino_ir_frontend.so.${OV_SHORT_VERSION} && \
ln -s libopenvino_ir_frontend.so.${OPENVINO_VERSION} libopenvino_ir_frontend.so)
cp -P /usr/lib/x86_64-linux-gnu/libtbb.so* lib/. && \
cp -P /workspace/install/runtime/lib/intel64/libopenvino*.so* lib/. \
"""

df += """
Expand Down Expand Up @@ -165,11 +146,7 @@ def dockerfile_for_windows(output_file):
ARG CMAKE_BAT="cmake \
-DCMAKE_BUILD_TYPE=%OPENVINO_BUILD_TYPE% \
-DCMAKE_INSTALL_PREFIX=C:/workspace/install \
-DENABLE_CLDNN=OFF \
-DENABLE_TESTS=OFF \
-DENABLE_VALIDATION_SET=OFF \
-DNGRAPH_ONNX_IMPORT_ENABLE=OFF \
-DNGRAPH_DEPRECATED_ENABLE=FALSE \
.."
ARG CMAKE_BUILD_BAT="cmake --build . --config %OPENVINO_BUILD_TYPE% --target install --verbose -j8"
RUN powershell Set-Content 'build.bat' -value '%VS_DEVCMD_BAT%','%CMAKE_BAT%','%CMAKE_BUILD_BAT%'
Expand Down

0 comments on commit 71ffc44

Please sign in to comment.