Cluster restart model auto redeploy #1627

zane-neo · 2023-11-14T11:45:54Z

Description

Support cluster level restart model auto redeploy.

Issues Resolved

Check List

New functionality includes testing.
- All tests pass
New functionality has been documented.
- New functionality has javadoc added
Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: zane-neo <[email protected]>

codecov · 2023-11-15T02:46:58Z

Codecov Report

Attention: 4 lines in your changes are missing coverage. Please review.

Comparison is base (6efb1eb) 80.72% compared to head (226fae3) 80.69%.
Report is 1 commits behind head on 2.x.

Files	Patch %	Lines
.../cluster/MLCommonsClusterManagerEventListener.java	73.33%	3 Missing and 1 partial ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##                2.x    #1627      +/-   ##
============================================
- Coverage     80.72%   80.69%   -0.03%     
- Complexity     4181     4183       +2     
============================================
  Files           399      399              
  Lines         16816    16839      +23     
  Branches       1815     1816       +1     
============================================
+ Hits          13574    13589      +15     
- Misses         2527     2536       +9     
+ Partials        715      714       -1

Flag	Coverage Δ
ml-commons	`80.69% <85.18%> (-0.03%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: zane-neo <[email protected]>

Zhangxunmt

Is the flag plugins.ml_commons.model_auto_redeploy.enable for cluster level or node level after this change?

Zhangxunmt · 2023-11-20T23:48:33Z

plugin/src/main/java/org/opensearch/ml/plugin/MachineLearningPlugin.java

@@ -685,6 +686,7 @@ public List<Setting<?>> getSettings() {
                MLCommonsSettings.ML_COMMONS_ENABLE_INHOUSE_PYTHON_MODEL,
                MLCommonsSettings.ML_COMMONS_MODEL_AUTO_REDEPLOY_ENABLE,
                MLCommonsSettings.ML_COMMONS_MODEL_AUTO_REDEPLOY_LIFETIME_RETRY_TIMES,
+                MLCommonsSettings.ML_COMMONS_MODEL_AUTO_REDEPLOY_SUCCESS_RATIO,


This setting is not used anywhere in this PR?

Didn't expose this setting when implementing the auto redeploy feature, so exposing it now to enable customer to change this ratio value.

ylwu-amzn · 2023-11-21T00:38:30Z

plugin/src/main/java/org/opensearch/ml/cluster/MLCommonsClusterManagerEventListener.java

+        threadPool
+            .schedule(
+                () -> mlModelAutoReDeployer.buildAutoReloadArrangement(List.of(localNodeId), localNodeId),
+                TimeValue.timeValueSeconds(jobInterval),


Can we start auto reload immediately ?

No, cluster manager ready doesn't mean cluster is ready, if we start auto reload immediately we can get a blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized]; exception when querying needs reload models.

zane-neo · 2023-11-22T00:56:58Z

Is the flag plugins.ml_commons.model_auto_redeploy.enable for cluster level or node level after this change?

Both level, this change is more like an enhancement to model auto redeploy to support cluster level.

dhrubo-os · 2023-11-22T19:37:04Z

Just curious to know if this will also re-deploy the hidden model (what we are building now? Currently only domain manager with having the admin SSL certificate can deploy the hidden model.

ylwu-amzn · 2023-11-22T23:44:34Z

Just curious to know if this will also re-deploy the hidden model (what we are building now? Currently only domain manager with having the admin SSL certificate can deploy the hidden model.

I think we have no way to support hidden model auto redeployment, domain manager should build some logic outside of cluster.

* Fix cluster level restart model not auto redeploy issue Signed-off-by: zane-neo <[email protected]> * Fix cluster level restart model not auto redeploy issuee Signed-off-by: zane-neo <[email protected]> * remove unuseful changes Signed-off-by: zane-neo <[email protected]> * format code Signed-off-by: zane-neo <[email protected]> * Add auto redeploy success ratio configuration Signed-off-by: zane-neo <[email protected]> * Add start cron job log Signed-off-by: zane-neo <[email protected]> --------- Signed-off-by: zane-neo <[email protected]> (cherry picked from commit 034a212)

* Fix cluster level restart model not auto redeploy issue Signed-off-by: zane-neo <[email protected]> * Fix cluster level restart model not auto redeploy issuee Signed-off-by: zane-neo <[email protected]> * remove unuseful changes Signed-off-by: zane-neo <[email protected]> * format code Signed-off-by: zane-neo <[email protected]> * Add auto redeploy success ratio configuration Signed-off-by: zane-neo <[email protected]> * Add start cron job log Signed-off-by: zane-neo <[email protected]> --------- Signed-off-by: zane-neo <[email protected]> (cherry picked from commit 034a212) Co-authored-by: zane-neo <[email protected]>

…arch-project#1827) * Fix cluster level restart model not auto redeploy issue Signed-off-by: zane-neo <[email protected]> * Fix cluster level restart model not auto redeploy issuee Signed-off-by: zane-neo <[email protected]> * remove unuseful changes Signed-off-by: zane-neo <[email protected]> * format code Signed-off-by: zane-neo <[email protected]> * Add auto redeploy success ratio configuration Signed-off-by: zane-neo <[email protected]> * Add start cron job log Signed-off-by: zane-neo <[email protected]> --------- Signed-off-by: zane-neo <[email protected]> (cherry picked from commit 034a212) Co-authored-by: zane-neo <[email protected]>

zane-neo added 2 commits November 14, 2023 11:24

Fix cluster level restart model not auto redeploy issue

bc39114

Signed-off-by: zane-neo <[email protected]>

Fix cluster level restart model not auto redeploy issuee

9a50c61

Signed-off-by: zane-neo <[email protected]>

zane-neo had a problem deploying to ml-commons-cicd-env November 14, 2023 11:46 — with GitHub Actions Error

zane-neo had a problem deploying to ml-commons-cicd-env November 14, 2023 11:46 — with GitHub Actions Failure

zane-neo had a problem deploying to ml-commons-cicd-env November 14, 2023 11:46 — with GitHub Actions Error

remove unuseful changes

a8c71f1

Signed-off-by: zane-neo <[email protected]>

zane-neo had a problem deploying to ml-commons-cicd-env November 15, 2023 02:27 — with GitHub Actions Failure

zane-neo had a problem deploying to ml-commons-cicd-env November 15, 2023 02:27 — with GitHub Actions Error

zane-neo had a problem deploying to ml-commons-cicd-env November 15, 2023 02:27 — with GitHub Actions Failure

format code

1b8a75c

Signed-off-by: zane-neo <[email protected]>

zane-neo temporarily deployed to ml-commons-cicd-env November 15, 2023 02:28 — with GitHub Actions Inactive

zane-neo had a problem deploying to ml-commons-cicd-env November 15, 2023 02:28 — with GitHub Actions Failure

zane-neo temporarily deployed to ml-commons-cicd-env November 15, 2023 02:28 — with GitHub Actions Inactive

zane-neo requested review from ylwu-amzn and dhrubo-os November 15, 2023 02:40

zane-neo had a problem deploying to ml-commons-cicd-env November 15, 2023 02:49 — with GitHub Actions Error

zane-neo had a problem deploying to ml-commons-cicd-env November 15, 2023 02:49 — with GitHub Actions Failure

Add auto redeploy success ratio configuration

459c438

Signed-off-by: zane-neo <[email protected]>

zane-neo had a problem deploying to ml-commons-cicd-env November 15, 2023 06:26 — with GitHub Actions Failure

zane-neo had a problem deploying to ml-commons-cicd-env November 15, 2023 06:26 — with GitHub Actions Error

zane-neo temporarily deployed to ml-commons-cicd-env November 15, 2023 06:26 — with GitHub Actions Inactive

zane-neo had a problem deploying to ml-commons-cicd-env November 15, 2023 06:47 — with GitHub Actions Error

zane-neo had a problem deploying to ml-commons-cicd-env November 15, 2023 06:47 — with GitHub Actions Failure

Add start cron job log

226fae3

Signed-off-by: zane-neo <[email protected]>

zane-neo had a problem deploying to ml-commons-cicd-env November 16, 2023 06:20 — with GitHub Actions Failure

zane-neo had a problem deploying to ml-commons-cicd-env November 16, 2023 06:20 — with GitHub Actions Error

zane-neo temporarily deployed to ml-commons-cicd-env November 16, 2023 06:20 — with GitHub Actions Inactive

zane-neo had a problem deploying to ml-commons-cicd-env November 16, 2023 06:42 — with GitHub Actions Error

zane-neo had a problem deploying to ml-commons-cicd-env November 16, 2023 06:42 — with GitHub Actions Failure

Zhangxunmt reviewed Nov 20, 2023

View reviewed changes

ylwu-amzn reviewed Nov 21, 2023

View reviewed changes

ylwu-amzn approved these changes Nov 22, 2023

View reviewed changes

Zhangxunmt approved these changes Nov 22, 2023

View reviewed changes

Zhangxunmt merged commit 034a212 into opensearch-project:2.x Nov 22, 2023
8 of 13 checks passed

zane-neo added the backport main label Dec 30, 2023

opensearch-trigger-bot bot mentioned this pull request Dec 30, 2023

[Backport main] Cluster restart model auto redeploy #1827

Merged

ylwu-amzn mentioned this pull request Feb 8, 2024

[BUG] Intermittent Model Deployment Issues in AWS OpenSearch: "Model not ready yet" Error #2050

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster restart model auto redeploy #1627

Cluster restart model auto redeploy #1627

zane-neo commented Nov 14, 2023 •

edited

Loading

codecov bot commented Nov 15, 2023 •

edited

Loading

Zhangxunmt left a comment

Zhangxunmt Nov 20, 2023

zane-neo Nov 21, 2023

ylwu-amzn Nov 21, 2023

zane-neo Nov 21, 2023 •

edited

Loading

zane-neo commented Nov 22, 2023

dhrubo-os commented Nov 22, 2023

ylwu-amzn commented Nov 22, 2023

Cluster restart model auto redeploy #1627

Cluster restart model auto redeploy #1627

Conversation

zane-neo commented Nov 14, 2023 • edited Loading

Description

Issues Resolved

Check List

codecov bot commented Nov 15, 2023 • edited Loading

Codecov Report

Zhangxunmt left a comment

Choose a reason for hiding this comment

Zhangxunmt Nov 20, 2023

Choose a reason for hiding this comment

zane-neo Nov 21, 2023

Choose a reason for hiding this comment

ylwu-amzn Nov 21, 2023

Choose a reason for hiding this comment

zane-neo Nov 21, 2023 • edited Loading

Choose a reason for hiding this comment

zane-neo commented Nov 22, 2023

dhrubo-os commented Nov 22, 2023

ylwu-amzn commented Nov 22, 2023

zane-neo commented Nov 14, 2023 •

edited

Loading

codecov bot commented Nov 15, 2023 •

edited

Loading

zane-neo Nov 21, 2023 •

edited

Loading