Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add support for TARGET_DEVICE parameter #87

Open
wants to merge 16 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 0 additions & 1 deletion CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -210,7 +210,6 @@ else()
COMMAND rm -fr openvino
COMMAND docker cp openvino_backend_ov:/opt/openvino openvino
COMMAND docker rm openvino_backend_ov
COMMAND echo '<ie><plugins><plugin name=\"CPU\" location=\"libopenvino_intel_cpu_plugin.so\"></plugin></plugins></ie>' >> openvino/lib/plugins.xml
COMMENT "Building OpenVino"
)
endif() # WIN32
Expand Down
16 changes: 16 additions & 0 deletions Dockerfile.drivers
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
ARG BASE_IMAGE=tritonserver:latest
FROM $BASE_IMAGE
RUN mkdir /tmp/neo && cd /tmp/neo && \
apt-get update && apt-get install -y libtbb12 curl && \
curl -L -O https://github.com/intel/intel-graphics-compiler/releases/download/igc-1.0.17791.9/intel-igc-core_1.0.17791.9_amd64.deb && \
curl -L -O https://github.com/intel/intel-graphics-compiler/releases/download/igc-1.0.17791.9/intel-igc-opencl_1.0.17791.9_amd64.deb && \
curl -L -O https://github.com/intel/compute-runtime/releases/download/24.39.31294.12/intel-level-zero-gpu_1.6.31294.12_amd64.deb && \
curl -L -O https://github.com/intel/compute-runtime/releases/download/24.39.31294.12/intel-opencl-icd_24.39.31294.12_amd64.deb && \
curl -L -O https://github.com/intel/compute-runtime/releases/download/24.39.31294.12/libigdgmm12_22.5.2_amd64.deb && \
curl -L -O https://github.com/oneapi-src/level-zero/releases/download/v1.17.44/level-zero_1.17.44+u24.04_amd64.deb && \
curl -L -O https://github.com/intel/linux-npu-driver/releases/download/v1.10.0/intel-driver-compiler-npu_1.10.0.20241107-11729849322_ubuntu24.04_amd64.deb && \
curl -L -O https://github.com/intel/linux-npu-driver/releases/download/v1.10.0/intel-fw-npu_1.10.0.20241107-11729849322_ubuntu24.04_amd64.deb && \
curl -L -O https://github.com/intel/linux-npu-driver/releases/download/v1.10.0/intel-level-zero-npu_1.10.0.20241107-11729849322_ubuntu24.04_amd64.deb && \
dpkg -i *.deb && \
apt-get install -y ocl-icd-libopencl1 --no-install-recommends && \
rm -rf /var/lib/apt/lists/* && rm -Rf /tmp/neo
74 changes: 73 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,7 @@ $ cd build
$ cmake -DCMAKE_INSTALL_PREFIX:PATH=`pwd`/install -DTRITON_BUILD_OPENVINO_VERSION=2024.4.0 -DTRITON_BUILD_CONTAINER_VERSION=24.03 ..
$ make install
```
The compiled backend will be added to `build/install/backends/openvino` folder.

The following required Triton repositories will be pulled and used in
the build. By default the "main" branch/tag will be used for each repo
Expand All @@ -71,6 +72,27 @@ but the listed CMake argument can be used to override.
* triton-inference-server/core: -DTRITON_CORE_REPO_TAG=[tag]
* triton-inference-server/common: -DTRITON_COMMON_REPO_TAG=[tag]

## Build a complete triton custom image with OpenVINO backend

```
git clone https://github.com/triton-inference-server/server
cd server
pip install distro requests
python3 build.py --target-platform linux --enable-logging --enable-stats --enable-metrics --enable-cpu-metrics --endpoint grpc --endpoint http --filesystem s3 \
--backend openvino:pull/87/head
```
In the backend value, the pull request is optional. Use `--backend openvino` to build from `main` branch.
It will create an image called `tritonserver:latest`

## Add Intel GPU and NPU dependencies to the image

The `Dockerfile.drivers` adds OpenVINO runtime drivers needed to run inference on the accelerators. Use, as the base image, public image with OpenVINO backend or the custom one.

```
docker build -f Dockerfile.drivers --build-arg BASE_IMAGE=tritonserver:latest -t tritonserver:xpu .
```


## Using the OpenVINO Backend

### Parameters
Expand All @@ -88,6 +110,7 @@ to skip the dynamic batch sizes in backend.
* `ENABLE_BATCH_PADDING`: By default an error will be generated if backend receives a request with batch size less than max_batch_size specified in the configuration. This error can be avoided at a cost of performance by specifying `ENABLE_BATCH_PADDING` parameter as `YES`.
* `RESHAPE_IO_LAYERS`: By setting this parameter as `YES`, the IO layers are reshaped to the dimensions provided in
model configuration. By default, the dimensions in the model is used.
* `TARGET_DEVICE`: Choose the OpenVINO device for running the inference. It could be CPU (default), GPU, NPU or any of the virtual devices like AUTO, MULTI, HETERO.



Expand Down Expand Up @@ -231,8 +254,57 @@ string_value:"yes"
}
}
```
### Running the models on Intel GPU

Add to your config.pbtxt a parameter `TARGET_DEVICE`:
```
parameters: [
{
key: "NUM_STREAMS"
value: {
string_value: "1"
}
},
{
key: "PERFORMANCE_HINT"
value: {
string_value: "THROUGHPUT"
}
},
{
key: "TARGET_DEVICE"
value: {
string_value: "GPU"
}
}
]
```

Start the container with extra parameter to pass the device `/dev/dri`:
```
docker run -it --rm --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1 ) tritonserver:xpu
```

### Running the models on Intel NPU

Add to your config.pbtxt a parameter `TARGET_DEVICE`:
```
parameters: [
{
key: "TARGET_DEVICE"
value: {
string_value: "NPU"
}
}
]
```

Start the container with extra parameter to pass the device `/dev/accel`:
```
docker run -it --rm --device --device /dev/accel --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) tritonserver:xpu
```

## Known Issues

* Models with the scalar on the input (shape without any dimension are not supported)
* Reshaping using [dimension ranges](https://docs.openvino.ai/2023.3/ovms_docs_dynamic_shape_dynamic_model.html) is not supported.
* Reshaping using [dimension ranges](https://docs.openvino.ai/2024/openvino-workflow/running-inference/dynamic-shapes.html#dimension-bounds) is not supported.
97 changes: 64 additions & 33 deletions src/openvino.cc
Original file line number Diff line number Diff line change
Expand Up @@ -84,6 +84,9 @@ class ModelState : public BackendModel {
TRITONSERVER_Error* ParseParameter(
const std::string& mkey, triton::common::TritonJson::Value& params,
std::vector<std::pair<std::string, ov::Any>>* device_config);
TRITONSERVER_Error* ParseStringParameter(
const std::string& mkey, triton::common::TritonJson::Value& params,
std::string* value);
TRITONSERVER_Error* ParseParameterHelper(
const std::string& mkey, std::string* value,
std::pair<std::string, ov::Any>* ov_property);
Expand Down Expand Up @@ -118,6 +121,7 @@ class ModelState : public BackendModel {

bool SkipDynamicBatchSize() { return skip_dynamic_batchsize_; }
bool EnableBatchPadding() { return enable_padding_; }
std::string TargetDevice() { return target_device_; }

private:
ModelState(TRITONBACKEND_Model* triton_model);
Expand All @@ -140,6 +144,7 @@ class ModelState : public BackendModel {
bool skip_dynamic_batchsize_;
bool enable_padding_;
bool reshape_io_layers_;
std::string target_device_;
};

TRITONSERVER_Error*
Expand Down Expand Up @@ -179,7 +184,7 @@ ModelState::Create(TRITONBACKEND_Model* triton_model, ModelState** state)
ModelState::ModelState(TRITONBACKEND_Model* triton_model)
: BackendModel(triton_model), model_read_(false),
skip_dynamic_batchsize_(false), enable_padding_(false),
reshape_io_layers_(false)
reshape_io_layers_(false), target_device_("CPU")
{
}

Expand Down Expand Up @@ -238,12 +243,11 @@ ModelState::ParseParameters()
bool status = model_config_.Find("parameters", &params);
if (status) {
RETURN_IF_ERROR(LoadCpuExtensions(params));
RETURN_IF_ERROR(ParseBoolParameter(
"SKIP_OV_DYNAMIC_BATCHSIZE", params, &skip_dynamic_batchsize_));
RETURN_IF_ERROR(
ParseBoolParameter("ENABLE_BATCH_PADDING", params, &enable_padding_));
RETURN_IF_ERROR(
ParseBoolParameter("RESHAPE_IO_LAYERS", params, &reshape_io_layers_));
ParseBoolParameter(
"SKIP_OV_DYNAMIC_BATCHSIZE", params, &skip_dynamic_batchsize_);
ParseBoolParameter("ENABLE_BATCH_PADDING", params, &enable_padding_);
ParseBoolParameter("RESHAPE_IO_LAYERS", params, &reshape_io_layers_);
ParseStringParameter("TARGET_DEVICE", params, &target_device_);
}

return nullptr;
Expand All @@ -256,18 +260,15 @@ ModelState::ParseParameters(const std::string& device)
triton::common::TritonJson::Value params;
bool status = model_config_.Find("parameters", &params);
if (status) {
if (device == "CPU") {
config_[device] = {};
auto& device_config = config_.at(device);
RETURN_IF_ERROR(
ParseParameter("INFERENCE_NUM_THREADS", params, &device_config));
RETURN_IF_ERROR(
ParseParameter("COMPILATION_NUM_THREADS", params, &device_config));
RETURN_IF_ERROR(ParseParameter("HINT_BF16", params, &device_config));
RETURN_IF_ERROR(ParseParameter("NUM_STREAMS", params, &device_config));
RETURN_IF_ERROR(
ParseParameter("PERFORMANCE_HINT", params, &device_config));
}
config_[device] = {};
auto& device_config = config_.at(device);
RETURN_IF_ERROR(
ParseParameter("INFERENCE_NUM_THREADS", params, &device_config));
RETURN_IF_ERROR(
ParseParameter("COMPILATION_NUM_THREADS", params, &device_config));
RETURN_IF_ERROR(ParseParameter("HINT_BF16", params, &device_config));
RETURN_IF_ERROR(ParseParameter("NUM_STREAMS", params, &device_config));
RETURN_IF_ERROR(ParseParameter("PERFORMANCE_HINT", params, &device_config));
}

return nullptr;
Expand All @@ -277,18 +278,16 @@ TRITONSERVER_Error*
ModelState::LoadCpuExtensions(triton::common::TritonJson::Value& params)
{
std::string cpu_ext_path;
LOG_IF_ERROR(
ReadParameter(params, "CPU_EXTENSION_PATH", &(cpu_ext_path)),
"error when reading parameters");
RETURN_IF_ERROR(
ReadParameter(params, "CPU_EXTENSION_PATH", &(cpu_ext_path), ""));
if (!cpu_ext_path.empty()) {
// CPU (MKLDNN) extensions is loaded as a shared library and passed as a
// pointer to base extension
RETURN_IF_OPENVINO_ERROR(
ov_core_.add_extension(cpu_ext_path), " loading custom CPU extensions");
LOG_MESSAGE(
TRITONSERVER_LOG_INFO,
(std::string("CPU (MKLDNN) extensions is loaded") + cpu_ext_path)
.c_str());
(std::string("CPU extensions is loaded") + cpu_ext_path).c_str());
}

return nullptr;
Expand All @@ -301,8 +300,7 @@ ModelState::ParseBoolParameter(
bool* setting)
{
std::string value;
LOG_IF_ERROR(
ReadParameter(params, mkey, &(value)), "error when reading parameters");
RETURN_IF_ERROR(ReadParameter(params, mkey, &(value), ""));
std::transform(
value.begin(), value.end(), value.begin(),
[](unsigned char c) { return std::tolower(c); });
Expand All @@ -313,14 +311,30 @@ ModelState::ParseBoolParameter(
return nullptr;
}

TRITONSERVER_Error*
ModelState::ParseStringParameter(
const std::string& mkey, triton::common::TritonJson::Value& params,
std::string* setting)
{
std::string value;
RETURN_IF_ERROR(ReadParameter(params, mkey, &(value), ""));
std::transform(
value.begin(), value.end(), value.begin(),
[](unsigned char c) { return std::toupper(c); });
if (value.length() > 0) {
*setting = value;
}

return nullptr;
}

TRITONSERVER_Error*
ModelState::ParseParameter(
const std::string& mkey, triton::common::TritonJson::Value& params,
std::vector<std::pair<std::string, ov::Any>>* device_config)
{
std::string value;
LOG_IF_ERROR(
ReadParameter(params, mkey, &(value)), "error when reading parameters");
RETURN_IF_ERROR(ReadParameter(params, mkey, &(value), ""));
if (!value.empty()) {
std::pair<std::string, ov::Any> ov_property;
RETURN_IF_ERROR(ParseParameterHelper(mkey, &value, &ov_property));
Expand Down Expand Up @@ -410,6 +424,16 @@ ModelState::ParseParameterHelper(
TRITONSERVER_Error*
ModelState::ConfigureOpenvinoCore()
{
auto availableDevices = ov_core_.get_available_devices();
std::stringstream list_of_devices;

for (auto& element : availableDevices) {
list_of_devices << element << ",";
}
LOG_MESSAGE(
TRITONSERVER_LOG_VERBOSE,
(std::string("Available OpenVINO devices: " + list_of_devices.str()))
.c_str());
for (auto&& item : config_) {
std::string device_name = item.first;
std::vector<std::pair<std::string, ov::Any>> properties = item.second;
Expand Down Expand Up @@ -438,9 +462,10 @@ ModelState::LoadModel(
std::to_string(OPENVINO_VERSION_MINOR) + "." +
std::to_string(OPENVINO_VERSION_PATCH))
.c_str());

LOG_MESSAGE(
TRITONSERVER_LOG_VERBOSE,
(std::string("Device info: \n") +
(std::string("Device info: ") +
ConvertVersionMapToString(ov_core_.get_versions(device)))
.c_str());

Expand Down Expand Up @@ -932,19 +957,26 @@ ModelInstanceState::Create(
ModelInstanceState::ModelInstanceState(
ModelState* model_state, TRITONBACKEND_ModelInstance* triton_model_instance)
: BackendModelInstance(model_state, triton_model_instance),
model_state_(model_state), device_("CPU"), batch_pad_size_(0)
model_state_(model_state), device_(model_state->TargetDevice()),
batch_pad_size_(0)
{
if (Kind() != TRITONSERVER_INSTANCEGROUPKIND_CPU) {
throw triton::backend::BackendModelInstanceException(TRITONSERVER_ErrorNew(
TRITONSERVER_ERROR_INVALID_ARG,
(std::string("unable to load model '") + model_state_->Name() +
"', Triton openVINO backend supports only CPU device")
"', Triton OpenVINO backend supports only Kind CPU and AUTO")
.c_str()));
}

if (model_state_->ModelNotRead()) {
std::string model_path;
THROW_IF_BACKEND_INSTANCE_ERROR(model_state_->ParseParameters());
device_ = model_state->TargetDevice();
LOG_MESSAGE(
TRITONSERVER_LOG_INFO,
(std::string("Target device " + device_)).c_str());


THROW_IF_BACKEND_INSTANCE_ERROR(
model_state_->ReadModel(ArtifactFilename(), &model_path));
THROW_IF_BACKEND_INSTANCE_ERROR(model_state_->ValidateConfigureModel());
Expand Down Expand Up @@ -1519,8 +1551,7 @@ TRITONBACKEND_ModelInstanceInitialize(TRITONBACKEND_ModelInstance* instance)
LOG_MESSAGE(
TRITONSERVER_LOG_INFO,
(std::string("TRITONBACKEND_ModelInstanceInitialize: ") + name + " (" +
TRITONSERVER_InstanceGroupKindString(kind) + " device " +
std::to_string(device_id) + ")")
TRITONSERVER_InstanceGroupKindString(kind) + ")")
.c_str());

// Get the model state associated with this instance's model.
Expand Down
11 changes: 6 additions & 5 deletions src/openvino_utils.cc
Original file line number Diff line number Diff line change
Expand Up @@ -277,13 +277,14 @@ CompareDimsSupported(
TRITONSERVER_Error*
ReadParameter(
triton::common::TritonJson::Value& params, const std::string& key,
std::string* param)
std::string* param, const std::string default_value)
{
triton::common::TritonJson::Value value;
RETURN_ERROR_IF_FALSE(
params.Find(key.c_str(), &value), TRITONSERVER_ERROR_INVALID_ARG,
std::string("model configuration is missing the parameter ") + key);
RETURN_IF_ERROR(value.MemberAsString("string_value", param));
if (params.Find(key.c_str(), &value)) {
RETURN_IF_ERROR(value.MemberAsString("string_value", param));
} else {
*param = default_value;
}
return nullptr; // success
}

Expand Down
2 changes: 1 addition & 1 deletion src/openvino_utils.h
Original file line number Diff line number Diff line change
Expand Up @@ -97,7 +97,7 @@ TRITONSERVER_Error* CompareDimsSupported(

TRITONSERVER_Error* ReadParameter(
triton::common::TritonJson::Value& params, const std::string& key,
std::string* param);
std::string* param, const std::string default_value);

std::vector<int64_t> ConvertToSignedShape(const ov::PartialShape& shape);

Expand Down
2 changes: 1 addition & 1 deletion tools/gen_openvino_dockerfile.py
Original file line number Diff line number Diff line change
Expand Up @@ -106,7 +106,7 @@ def dockerfile_for_linux(output_file):
cp -r /workspace/install/runtime/include/* include/.
RUN mkdir -p lib && \
cp -P /workspace/install/runtime/lib/intel64/*.so* lib/. && \
cp -P /workspace/install/runtime/3rdparty/tbb/lib/libtbb.so* lib/. \
cp -P /workspace/install/runtime/lib/intel64/libopenvino*.so* lib/.
"""

df += """
Expand Down