-
Notifications
You must be signed in to change notification settings - Fork 138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cluster restart model auto redeploy #1627
Cluster restart model auto redeploy #1627
Conversation
Signed-off-by: zane-neo <[email protected]>
Signed-off-by: zane-neo <[email protected]>
Signed-off-by: zane-neo <[email protected]>
Signed-off-by: zane-neo <[email protected]>
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## 2.x #1627 +/- ##
============================================
- Coverage 80.72% 80.69% -0.03%
- Complexity 4181 4183 +2
============================================
Files 399 399
Lines 16816 16839 +23
Branches 1815 1816 +1
============================================
+ Hits 13574 13589 +15
- Misses 2527 2536 +9
+ Partials 715 714 -1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Signed-off-by: zane-neo <[email protected]>
Signed-off-by: zane-neo <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the flag plugins.ml_commons.model_auto_redeploy.enable for cluster level or node level after this change?
@@ -685,6 +686,7 @@ public List<Setting<?>> getSettings() { | |||
MLCommonsSettings.ML_COMMONS_ENABLE_INHOUSE_PYTHON_MODEL, | |||
MLCommonsSettings.ML_COMMONS_MODEL_AUTO_REDEPLOY_ENABLE, | |||
MLCommonsSettings.ML_COMMONS_MODEL_AUTO_REDEPLOY_LIFETIME_RETRY_TIMES, | |||
MLCommonsSettings.ML_COMMONS_MODEL_AUTO_REDEPLOY_SUCCESS_RATIO, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This setting is not used anywhere in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't expose this setting when implementing the auto redeploy feature, so exposing it now to enable customer to change this ratio value.
threadPool | ||
.schedule( | ||
() -> mlModelAutoReDeployer.buildAutoReloadArrangement(List.of(localNodeId), localNodeId), | ||
TimeValue.timeValueSeconds(jobInterval), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we start auto reload immediately ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, cluster manager ready doesn't mean cluster is ready, if we start auto reload immediately we can get a blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];
exception when querying needs reload models.
Both level, this change is more like an enhancement to model auto redeploy to support cluster level. |
Just curious to know if this will also re-deploy the hidden model (what we are building now? Currently only domain manager with having the admin SSL certificate can deploy the hidden model. |
I think we have no way to support hidden model auto redeployment, domain manager should build some logic outside of cluster. |
* Fix cluster level restart model not auto redeploy issue Signed-off-by: zane-neo <[email protected]> * Fix cluster level restart model not auto redeploy issuee Signed-off-by: zane-neo <[email protected]> * remove unuseful changes Signed-off-by: zane-neo <[email protected]> * format code Signed-off-by: zane-neo <[email protected]> * Add auto redeploy success ratio configuration Signed-off-by: zane-neo <[email protected]> * Add start cron job log Signed-off-by: zane-neo <[email protected]> --------- Signed-off-by: zane-neo <[email protected]> (cherry picked from commit 034a212)
* Fix cluster level restart model not auto redeploy issue Signed-off-by: zane-neo <[email protected]> * Fix cluster level restart model not auto redeploy issuee Signed-off-by: zane-neo <[email protected]> * remove unuseful changes Signed-off-by: zane-neo <[email protected]> * format code Signed-off-by: zane-neo <[email protected]> * Add auto redeploy success ratio configuration Signed-off-by: zane-neo <[email protected]> * Add start cron job log Signed-off-by: zane-neo <[email protected]> --------- Signed-off-by: zane-neo <[email protected]> (cherry picked from commit 034a212) Co-authored-by: zane-neo <[email protected]>
…arch-project#1827) * Fix cluster level restart model not auto redeploy issue Signed-off-by: zane-neo <[email protected]> * Fix cluster level restart model not auto redeploy issuee Signed-off-by: zane-neo <[email protected]> * remove unuseful changes Signed-off-by: zane-neo <[email protected]> * format code Signed-off-by: zane-neo <[email protected]> * Add auto redeploy success ratio configuration Signed-off-by: zane-neo <[email protected]> * Add start cron job log Signed-off-by: zane-neo <[email protected]> --------- Signed-off-by: zane-neo <[email protected]> (cherry picked from commit 034a212) Co-authored-by: zane-neo <[email protected]>
Description
Support cluster level restart model auto redeploy.
Issues Resolved
#1641
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.