-
Notifications
You must be signed in to change notification settings - Fork 138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add model auto redeploy tutorial #1175
Add model auto redeploy tutorial #1175
Conversation
Signed-off-by: zane-neo <[email protected]>
feature is pretty simple. | ||
|
||
# Enable model auto redeploy | ||
There's several configurations to control the model auto redeploy feature behavior including: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor : There's
-> There are
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
Codecov Report
@@ Coverage Diff @@
## 2.x #1175 +/- ##
============================================
- Coverage 78.94% 78.89% -0.05%
+ Complexity 2118 2116 -2
============================================
Files 167 167
Lines 8633 8633
Branches 869 869
============================================
- Hits 6815 6811 -4
- Misses 1425 1428 +3
- Partials 393 394 +1
Flags with carried forward coverage won't be shown. Click here to find out more. |
Signed-off-by: zane-neo <[email protected]>
### plugins.ml_commons.model_auto_redeploy.enable | ||
The default value of this configuration is false, once it's set to true, it means the model auto redeploy feature is enabled. | ||
### plugins.ml_commons.model_auto_redeploy.lifetime_retry_times | ||
This configuration means how many auto redeploy failure we can tolerate, default value is 3 which means once 3 times of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also provide the value range for these variables?
if 80% greater or equals 80% nodes successfully redeployed a model, that model's auto redeployment is success. | ||
|
||
# Limitation | ||
Model auto redeploy is to handle all the failure node cases, but it still has limitation. Under the hood, ml-commons use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about:
The auto redeployment of models is designed to handle all cases involving node failures, but it does have its limitations. Under the hood, ml-commons uses a cron job to sync up the status of all models in a cluster. This cron job checks all the failed nodes and removes them from the internal model routing table (a mapping between model ID and operational node IDs) to ensure that requests won't be dispatched to crashed nodes.
However, there's one scenario that model auto redeployment cannot handle, which is a complete cluster restart. In this situation, if all nodes are shut down, the last live node's cron job will detect that all the models are not functioning correctly and update the models' status to DEPLOY_FAILED. The model auto redeploy won't check this status because it's not a valid redeployment status. In this case, the user will have to invoke the model deploy/load API.
For cases where only some nodes crash, once the plugins.ml_commons.model_auto_redeploy.enable configuration is set to true, the models will automatically redeploy on those crashed nodes.
Signed-off-by: zane-neo <[email protected]>
* Add model auto redeploy tutorial Signed-off-by: zane-neo <[email protected]> * Fix minor issue Signed-off-by: zane-neo <[email protected]> * Change to recomanded statement Signed-off-by: zane-neo <[email protected]> --------- Signed-off-by: zane-neo <[email protected]>
* Add model auto redeploy tutorial Signed-off-by: zane-neo <[email protected]> * Fix minor issue Signed-off-by: zane-neo <[email protected]> * Change to recomanded statement Signed-off-by: zane-neo <[email protected]> --------- Signed-off-by: zane-neo <[email protected]>
Description
[Describe what this change achieves]
Issues Resolved
[List any issues this PR will resolve]
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.