Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add model auto redeploy tutorial #1175

Merged

Conversation

zane-neo
Copy link
Collaborator

@zane-neo zane-neo commented Aug 1, 2023

Description

[Describe what this change achieves]

Issues Resolved

[List any issues this PR will resolve]

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@zane-neo zane-neo had a problem deploying to ml-commons-cicd-env August 1, 2023 06:28 — with GitHub Actions Failure
@zane-neo zane-neo temporarily deployed to ml-commons-cicd-env August 1, 2023 06:28 — with GitHub Actions Inactive
@zane-neo zane-neo temporarily deployed to ml-commons-cicd-env August 1, 2023 06:28 — with GitHub Actions Inactive
@zane-neo zane-neo requested a review from ylwu-amzn August 1, 2023 06:28
feature is pretty simple.

# Enable model auto redeploy
There's several configurations to control the model auto redeploy feature behavior including:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor : There's -> There are

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

@codecov
Copy link

codecov bot commented Aug 1, 2023

Codecov Report

Merging #1175 (86841ea) into 2.x (5b4bd51) will decrease coverage by 0.05%.
Report is 1 commits behind head on 2.x.
The diff coverage is n/a.

❗ Current head 86841ea differs from pull request most recent head 57ffa5d. Consider uploading reports for the commit 57ffa5d to get more accurate results

@@             Coverage Diff              @@
##                2.x    #1175      +/-   ##
============================================
- Coverage     78.94%   78.89%   -0.05%     
+ Complexity     2118     2116       -2     
============================================
  Files           167      167              
  Lines          8633     8633              
  Branches        869      869              
============================================
- Hits           6815     6811       -4     
- Misses         1425     1428       +3     
- Partials        393      394       +1     
Flag Coverage Δ
ml-commons 78.89% <ø> (-0.05%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

see 1 file with indirect coverage changes

Signed-off-by: zane-neo <[email protected]>
@zane-neo zane-neo temporarily deployed to ml-commons-cicd-env August 1, 2023 09:39 — with GitHub Actions Inactive
@zane-neo zane-neo temporarily deployed to ml-commons-cicd-env August 1, 2023 09:39 — with GitHub Actions Inactive
@zane-neo zane-neo temporarily deployed to ml-commons-cicd-env August 1, 2023 09:39 — with GitHub Actions Inactive
@zane-neo zane-neo temporarily deployed to ml-commons-cicd-env August 1, 2023 09:39 — with GitHub Actions Inactive
ylwu-amzn
ylwu-amzn previously approved these changes Aug 1, 2023
### plugins.ml_commons.model_auto_redeploy.enable
The default value of this configuration is false, once it's set to true, it means the model auto redeploy feature is enabled.
### plugins.ml_commons.model_auto_redeploy.lifetime_retry_times
This configuration means how many auto redeploy failure we can tolerate, default value is 3 which means once 3 times of
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also provide the value range for these variables?

if 80% greater or equals 80% nodes successfully redeployed a model, that model's auto redeployment is success.

# Limitation
Model auto redeploy is to handle all the failure node cases, but it still has limitation. Under the hood, ml-commons use
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about:

The auto redeployment of models is designed to handle all cases involving node failures, but it does have its limitations. Under the hood, ml-commons uses a cron job to sync up the status of all models in a cluster. This cron job checks all the failed nodes and removes them from the internal model routing table (a mapping between model ID and operational node IDs) to ensure that requests won't be dispatched to crashed nodes.

However, there's one scenario that model auto redeployment cannot handle, which is a complete cluster restart. In this situation, if all nodes are shut down, the last live node's cron job will detect that all the models are not functioning correctly and update the models' status to DEPLOY_FAILED. The model auto redeploy won't check this status because it's not a valid redeployment status. In this case, the user will have to invoke the model deploy/load API.

For cases where only some nodes crash, once the plugins.ml_commons.model_auto_redeploy.enable configuration is set to true, the models will automatically redeploy on those crashed nodes.

@zane-neo zane-neo merged commit 7bdc5e4 into opensearch-project:2.x Aug 2, 2023
zane-neo added a commit to zane-neo/ml-commons that referenced this pull request Sep 1, 2023
* Add model auto redeploy tutorial

Signed-off-by: zane-neo <[email protected]>

* Fix minor issue

Signed-off-by: zane-neo <[email protected]>

* Change to recomanded statement

Signed-off-by: zane-neo <[email protected]>

---------

Signed-off-by: zane-neo <[email protected]>
zane-neo added a commit that referenced this pull request Sep 1, 2023
* Add model auto redeploy tutorial

Signed-off-by: zane-neo <[email protected]>

* Fix minor issue

Signed-off-by: zane-neo <[email protected]>

* Change to recomanded statement

Signed-off-by: zane-neo <[email protected]>

---------

Signed-off-by: zane-neo <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants