From 4814bb6513ef517d497fc88e06cb04504a5854e0 Mon Sep 17 00:00:00 2001 From: zane-neo Date: Tue, 1 Aug 2023 14:27:27 +0800 Subject: [PATCH 1/3] Add model auto redeploy tutorial Signed-off-by: zane-neo --- docs/tutorials/model_auto_redeploy.md | 66 +++++++++++++++++++++++++++ 1 file changed, 66 insertions(+) create mode 100644 docs/tutorials/model_auto_redeploy.md diff --git a/docs/tutorials/model_auto_redeploy.md b/docs/tutorials/model_auto_redeploy.md new file mode 100644 index 0000000000..1cdafe7acb --- /dev/null +++ b/docs/tutorials/model_auto_redeploy.md @@ -0,0 +1,66 @@ +# Topic +This doc explains how to use model auto redeploy feature in ml-commons(This doc works for OpenSearch 2.8+). + +# Background +After ml-commons support serving model inside OpenSearch cluster, we need to take care the node failure case for the +deployed models, this can save user's effort to maintain the model's availability. For example, once a node is down +caused by arbitrary reason and then restarted either by script or manually, if there isn't the model auto redeploy +feature, the model runs on a smaller cluster(expected nodes - 1) which could cause more nodes failure since each working +node is handling more traffic than it expected. To address this we introduced model auto redeploy feature and enabling this +feature is pretty simple. + +# Enable model auto redeploy +There's several configurations to control the model auto redeploy feature behavior including: +### plugins.ml_commons.model_auto_redeploy.enable +The default value of this configuration is false, once it's set to true, it means the model auto redeploy feature is enabled. +### plugins.ml_commons.model_auto_redeploy.lifetime_retry_times +This configuration means how many auto redeploy failure we can tolerate, default value is 3 which means once 3 times of +`failure auto redeploy` reached, the system will not retry auto redeploy anymore. But once auto redeploy is successful, +the retry time will be reset to 0. +### plugins.ml_commons.model_auto_redeploy_success_ratio +This configuration means how to determine if an auto redeployment is success or not. Since node failure is random, we can't +make sure model auto redeploy can be successful at any time in a cluster, so if most of the expected working nodes have +been successfully redeployed the model, the retry is considered successful. The default value of this is 0.8 which means +if 80% greater or equals 80% nodes successfully redeployed a model, that model's auto redeployment is success. + +# Limitation +Model auto redeploy is to handle all the failure node cases, but it still has limitation. Under the hood, ml-commons use +cron job to sync up all model's status in a cluster, the cron job checks all the failure nodes and remove the failure nodes +in the internal model routing table(this is a mapping between model id and working node ids) to make sure the request won't +be dispatched to crash nodes. +So one case model auto redeploy can't handle is the whole cluster restart, once all nodes are shut down, the last live +node's cron job will detect that all the models are not working correctly and update the model's status to `DEPLOY_FAILED`. +Model auto redeploy won't check this status since this is not a valid redeployment status. In this case, user has to invoke +the [model deploy/load API](https://opensearch.org/docs/latest/ml-commons-plugin/api/#deploying-a-model). +For partial nodes crash case, once the `plugins.ml_commons.model_auto_redeploy.enable` configuration is set to true, the +models will automatically redeploy on those crash nodes. + +# Example +An example to enable the model auto redeploy feature is via changing the configuration like below: +``` +PUT /_cluster/settings +{ + "persistent" : { + "plugins.ml_commons.model_auto_redeploy.enable" : true + } +} +``` +One can also change other two configuration to get desire behavior like below: +``` +Changes the life-time retry times to 10: +PUT /_cluster/settings +{ + "persistent" : { + "plugins.ml_commons.model_auto_redeploy.lifetime_retry_times" : 10 + } +} + +Change the determination of success to 70% expected work nodes in the cluster: +PUT /_cluster/settings +{ + "persistent" : { + "plugins.ml_commons.model_auto_redeploy.lifetime_retry_times" : 10 + } +} +``` + From 86841eaeb803e229b943256a85b23c0664dd2f15 Mon Sep 17 00:00:00 2001 From: zane-neo Date: Tue, 1 Aug 2023 17:39:26 +0800 Subject: [PATCH 2/3] Fix minor issue Signed-off-by: zane-neo --- docs/tutorials/model_auto_redeploy.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/tutorials/model_auto_redeploy.md b/docs/tutorials/model_auto_redeploy.md index 1cdafe7acb..92160ceff2 100644 --- a/docs/tutorials/model_auto_redeploy.md +++ b/docs/tutorials/model_auto_redeploy.md @@ -10,7 +10,7 @@ node is handling more traffic than it expected. To address this we introduced mo feature is pretty simple. # Enable model auto redeploy -There's several configurations to control the model auto redeploy feature behavior including: +There are several configurations to control the model auto redeploy feature behavior including: ### plugins.ml_commons.model_auto_redeploy.enable The default value of this configuration is false, once it's set to true, it means the model auto redeploy feature is enabled. ### plugins.ml_commons.model_auto_redeploy.lifetime_retry_times From 57ffa5df203ba2945a356a9d12fc5249dd778ed3 Mon Sep 17 00:00:00 2001 From: zane-neo Date: Wed, 2 Aug 2023 08:15:25 +0800 Subject: [PATCH 3/3] Change to recomanded statement Signed-off-by: zane-neo --- docs/tutorials/model_auto_redeploy.md | 38 ++++++++++++++++----------- 1 file changed, 22 insertions(+), 16 deletions(-) diff --git a/docs/tutorials/model_auto_redeploy.md b/docs/tutorials/model_auto_redeploy.md index 92160ceff2..f23124848c 100644 --- a/docs/tutorials/model_auto_redeploy.md +++ b/docs/tutorials/model_auto_redeploy.md @@ -12,28 +12,34 @@ feature is pretty simple. # Enable model auto redeploy There are several configurations to control the model auto redeploy feature behavior including: ### plugins.ml_commons.model_auto_redeploy.enable -The default value of this configuration is false, once it's set to true, it means the model auto redeploy feature is enabled. +The default value of this configuration is false, value range is: [true, false], once it's set to true, +it means the model auto redeploy feature is enabled. ### plugins.ml_commons.model_auto_redeploy.lifetime_retry_times -This configuration means how many auto redeploy failure we can tolerate, default value is 3 which means once 3 times of -`failure auto redeploy` reached, the system will not retry auto redeploy anymore. But once auto redeploy is successful, -the retry time will be reset to 0. +This configuration means how many auto redeploy failure we can tolerate, value range is: [0, Integer.MAX_VALUE], +default value is 3 which means once 3 times of `failure auto redeploy` reached, the system will not retry auto +redeploy anymore. But once auto redeploy is successful, the retry time will be reset to 0. ### plugins.ml_commons.model_auto_redeploy_success_ratio This configuration means how to determine if an auto redeployment is success or not. Since node failure is random, we can't make sure model auto redeploy can be successful at any time in a cluster, so if most of the expected working nodes have -been successfully redeployed the model, the retry is considered successful. The default value of this is 0.8 which means -if 80% greater or equals 80% nodes successfully redeployed a model, that model's auto redeployment is success. +been successfully redeployed the model, the retry is considered successful. The value range is: [0, 1], and the default +value of this is 0.8 which means if 80% greater or equals 80% nodes successfully redeployed a model, that model's auto +redeployment is success. # Limitation -Model auto redeploy is to handle all the failure node cases, but it still has limitation. Under the hood, ml-commons use -cron job to sync up all model's status in a cluster, the cron job checks all the failure nodes and remove the failure nodes -in the internal model routing table(this is a mapping between model id and working node ids) to make sure the request won't -be dispatched to crash nodes. -So one case model auto redeploy can't handle is the whole cluster restart, once all nodes are shut down, the last live -node's cron job will detect that all the models are not working correctly and update the model's status to `DEPLOY_FAILED`. -Model auto redeploy won't check this status since this is not a valid redeployment status. In this case, user has to invoke -the [model deploy/load API](https://opensearch.org/docs/latest/ml-commons-plugin/api/#deploying-a-model). -For partial nodes crash case, once the `plugins.ml_commons.model_auto_redeploy.enable` configuration is set to true, the -models will automatically redeploy on those crash nodes. +The auto redeployment of models is designed to handle all cases involving node failures, but it does have its limitations. +Under the hood, ml-commons uses a cron job to sync up the status of all models in a cluster. +This cron job checks all the failed nodes and removes them from the internal model routing +table (a mapping between model ID and operational node IDs) to ensure that requests won't be dispatched to +crashed nodes. + +However, there's one scenario that model auto redeployment cannot handle, which is a complete cluster restart. +In this situation, if all nodes are shut down, the last live node's cron job will detect that all the models +are not functioning correctly and update the models' status to DEPLOY_FAILED. +The model auto redeploy won't check this status because it's not a valid redeployment status. +In this case, the user will have to invoke the model deploy/load API. + +For cases where only some nodes crash, once the plugins.ml_commons.model_auto_redeploy.enable configuration +is set to true, the models will automatically redeploy on those crashed nodes. # Example An example to enable the model auto redeploy feature is via changing the configuration like below: