Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster restart model auto redeploy #1627

Merged

Conversation

zane-neo
Copy link
Collaborator

@zane-neo zane-neo commented Nov 14, 2023

Description

Support cluster level restart model auto redeploy.

Issues Resolved

#1641

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: zane-neo <[email protected]>
Signed-off-by: zane-neo <[email protected]>
@zane-neo zane-neo temporarily deployed to ml-commons-cicd-env November 15, 2023 02:28 — with GitHub Actions Inactive
@zane-neo zane-neo temporarily deployed to ml-commons-cicd-env November 15, 2023 02:28 — with GitHub Actions Inactive
@zane-neo zane-neo temporarily deployed to ml-commons-cicd-env November 15, 2023 02:28 — with GitHub Actions Inactive
Copy link

codecov bot commented Nov 15, 2023

Codecov Report

Attention: 4 lines in your changes are missing coverage. Please review.

Comparison is base (6efb1eb) 80.72% compared to head (226fae3) 80.69%.
Report is 1 commits behind head on 2.x.

Files Patch % Lines
.../cluster/MLCommonsClusterManagerEventListener.java 73.33% 3 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##                2.x    #1627      +/-   ##
============================================
- Coverage     80.72%   80.69%   -0.03%     
- Complexity     4181     4183       +2     
============================================
  Files           399      399              
  Lines         16816    16839      +23     
  Branches       1815     1816       +1     
============================================
+ Hits          13574    13589      +15     
- Misses         2527     2536       +9     
+ Partials        715      714       -1     
Flag Coverage Δ
ml-commons 80.69% <85.18%> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: zane-neo <[email protected]>
Copy link
Collaborator

@Zhangxunmt Zhangxunmt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the flag plugins.ml_commons.model_auto_redeploy.enable for cluster level or node level after this change?

@@ -685,6 +686,7 @@ public List<Setting<?>> getSettings() {
MLCommonsSettings.ML_COMMONS_ENABLE_INHOUSE_PYTHON_MODEL,
MLCommonsSettings.ML_COMMONS_MODEL_AUTO_REDEPLOY_ENABLE,
MLCommonsSettings.ML_COMMONS_MODEL_AUTO_REDEPLOY_LIFETIME_RETRY_TIMES,
MLCommonsSettings.ML_COMMONS_MODEL_AUTO_REDEPLOY_SUCCESS_RATIO,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This setting is not used anywhere in this PR?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't expose this setting when implementing the auto redeploy feature, so exposing it now to enable customer to change this ratio value.

threadPool
.schedule(
() -> mlModelAutoReDeployer.buildAutoReloadArrangement(List.of(localNodeId), localNodeId),
TimeValue.timeValueSeconds(jobInterval),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we start auto reload immediately ?

Copy link
Collaborator Author

@zane-neo zane-neo Nov 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, cluster manager ready doesn't mean cluster is ready, if we start auto reload immediately we can get a blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized]; exception when querying needs reload models.

@zane-neo
Copy link
Collaborator Author

Is the flag plugins.ml_commons.model_auto_redeploy.enable for cluster level or node level after this change?

Both level, this change is more like an enhancement to model auto redeploy to support cluster level.

@dhrubo-os
Copy link
Collaborator

Just curious to know if this will also re-deploy the hidden model (what we are building now? Currently only domain manager with having the admin SSL certificate can deploy the hidden model.

@Zhangxunmt Zhangxunmt merged commit 034a212 into opensearch-project:2.x Nov 22, 2023
8 of 13 checks passed
@ylwu-amzn
Copy link
Collaborator

Just curious to know if this will also re-deploy the hidden model (what we are building now? Currently only domain manager with having the admin SSL certificate can deploy the hidden model.

I think we have no way to support hidden model auto redeployment, domain manager should build some logic outside of cluster.

opensearch-trigger-bot bot pushed a commit that referenced this pull request Dec 30, 2023
* Fix cluster level restart model not auto redeploy issue

Signed-off-by: zane-neo <[email protected]>

* Fix cluster level restart model not auto redeploy issuee

Signed-off-by: zane-neo <[email protected]>

* remove unuseful changes

Signed-off-by: zane-neo <[email protected]>

* format code

Signed-off-by: zane-neo <[email protected]>

* Add auto redeploy success ratio configuration

Signed-off-by: zane-neo <[email protected]>

* Add start cron job log

Signed-off-by: zane-neo <[email protected]>

---------

Signed-off-by: zane-neo <[email protected]>
(cherry picked from commit 034a212)
ylwu-amzn pushed a commit that referenced this pull request Dec 30, 2023
* Fix cluster level restart model not auto redeploy issue

Signed-off-by: zane-neo <[email protected]>

* Fix cluster level restart model not auto redeploy issuee

Signed-off-by: zane-neo <[email protected]>

* remove unuseful changes

Signed-off-by: zane-neo <[email protected]>

* format code

Signed-off-by: zane-neo <[email protected]>

* Add auto redeploy success ratio configuration

Signed-off-by: zane-neo <[email protected]>

* Add start cron job log

Signed-off-by: zane-neo <[email protected]>

---------

Signed-off-by: zane-neo <[email protected]>
(cherry picked from commit 034a212)

Co-authored-by: zane-neo <[email protected]>
austintlee pushed a commit to austintlee/ml-commons that referenced this pull request Mar 19, 2024
…arch-project#1827)

* Fix cluster level restart model not auto redeploy issue

Signed-off-by: zane-neo <[email protected]>

* Fix cluster level restart model not auto redeploy issuee

Signed-off-by: zane-neo <[email protected]>

* remove unuseful changes

Signed-off-by: zane-neo <[email protected]>

* format code

Signed-off-by: zane-neo <[email protected]>

* Add auto redeploy success ratio configuration

Signed-off-by: zane-neo <[email protected]>

* Add start cron job log

Signed-off-by: zane-neo <[email protected]>

---------

Signed-off-by: zane-neo <[email protected]>
(cherry picked from commit 034a212)

Co-authored-by: zane-neo <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants