avoid race condition in syncup model state refresh and handle NP of I… #2405

Zhangxunmt · 2024-05-03T18:34:52Z

Description

Avoid race condition in sync up job to un-deploy expired models and refresh model states. Without this change, the sync up job sometimes sets wrong Model State after model is un-deployed. Default AutoDeploy to enabled when it's a null value.

Verified and tested in local 2.14 cluster.

Issues Resolved

Related to #2376

Check List

New functionality includes testing.
- All tests pass
New functionality has been documented.
- New functionality has javadoc added
Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

…sAutoDeployEnabled Signed-off-by: Xun Zhang <[email protected]>

ylwu-amzn · 2024-05-03T19:32:48Z

plugin/src/main/java/org/opensearch/ml/cluster/MLSyncUpCron.java

@@ -97,6 +99,9 @@ public void run() {
        // gather running model/tasks on nodes
        client.execute(MLSyncUpAction.INSTANCE, gatherInfoRequest, ActionListener.wrap(r -> {
            List<MLSyncUpNodeResponse> responses = r.getNodes();
+            if (r.failures() != null && r.failures().size() != 0) {
+                log.debug("Received {} failures in the sync up response on nodes", r.failures().size());


Seems not very helpful if we don't log the error message

updated in the refresh.

ylwu-amzn · 2024-05-03T19:36:41Z

plugin/src/main/java/org/opensearch/ml/cluster/MLSyncUpCron.java

+                if (!modelsToUndeploy.isEmpty()) {
+                    // Undeploy expired models
+                    undeployExpiredModels(modelsToUndeploy, modelWorkerNodes, deployingModels);
+                    return;


Why return here ? If we have 10 models, modelsToUndeploy has 4 models. We should also call refreshModelState for the other 6, right?

The refreshModelState is called in the listener of unDeployModels after the expired models are un-deployed. Check the undeployExpiredModels(), So the modelstate will be refreshed anyways.

ylwu-amzn · 2024-05-04T03:50:01Z

plugin/src/main/java/org/opensearch/ml/cluster/MLSyncUpCron.java

@@ -97,6 +99,9 @@ public void run() {
        // gather running model/tasks on nodes
        client.execute(MLSyncUpAction.INSTANCE, gatherInfoRequest, ActionListener.wrap(r -> {
            List<MLSyncUpNodeResponse> responses = r.getNodes();
+            if (r.failures() != null && r.failures().size() != 0) {
+                log.error("Received {} failures in the sync up response on nodes", r.failures().size());


I think you misunderstood me. I mean we should log the error message in r.failures

Added the messages in the logging.

Signed-off-by: Xun Zhang <[email protected]>

#2405) * avoid race condition in syncup model state refresh and handle NP of IsAutoDeployEnabled Signed-off-by: Xun Zhang <[email protected]> * log the error message from syncUp response Signed-off-by: Xun Zhang <[email protected]> * include the syncup response error messages as a string to help debug Signed-off-by: Xun Zhang <[email protected]> --------- Signed-off-by: Xun Zhang <[email protected]> (cherry picked from commit 21bf079)

#2405) (#2408) * avoid race condition in syncup model state refresh and handle NP of IsAutoDeployEnabled Signed-off-by: Xun Zhang <[email protected]> * log the error message from syncUp response Signed-off-by: Xun Zhang <[email protected]> * include the syncup response error messages as a string to help debug Signed-off-by: Xun Zhang <[email protected]> --------- Signed-off-by: Xun Zhang <[email protected]> (cherry picked from commit 21bf079) Co-authored-by: Xun Zhang <[email protected]>

#2405) (#2409) * avoid race condition in syncup model state refresh and handle NP of IsAutoDeployEnabled Signed-off-by: Xun Zhang <[email protected]> * log the error message from syncUp response Signed-off-by: Xun Zhang <[email protected]> * include the syncup response error messages as a string to help debug Signed-off-by: Xun Zhang <[email protected]> --------- Signed-off-by: Xun Zhang <[email protected]> (cherry picked from commit 21bf079) Co-authored-by: Xun Zhang <[email protected]>

opensearch-project#2405) (opensearch-project#2408) * avoid race condition in syncup model state refresh and handle NP of IsAutoDeployEnabled Signed-off-by: Xun Zhang <[email protected]> * log the error message from syncUp response Signed-off-by: Xun Zhang <[email protected]> * include the syncup response error messages as a string to help debug Signed-off-by: Xun Zhang <[email protected]> --------- Signed-off-by: Xun Zhang <[email protected]> (cherry picked from commit 21bf079) Co-authored-by: Xun Zhang <[email protected]>

Zhangxunmt requested review from b4sjoo, dhrubo-os, jngz-es, model-collapse, rbhavna, ylwu-amzn, zane-neo, austintlee, HenryL27, sam-herman and xinyual as code owners May 3, 2024 18:34

Zhangxunmt had a problem deploying to ml-commons-cicd-env May 3, 2024 18:34 — with GitHub Actions Failure

Zhangxunmt had a problem deploying to ml-commons-cicd-env May 3, 2024 18:34 — with GitHub Actions Error

Zhangxunmt added backport 2.x backport 2.14 labels May 3, 2024

Zhangxunmt had a problem deploying to ml-commons-cicd-env May 3, 2024 18:35 — with GitHub Actions Error

Zhangxunmt had a problem deploying to ml-commons-cicd-env May 3, 2024 18:35 — with GitHub Actions Failure

Zhangxunmt had a problem deploying to ml-commons-cicd-env May 3, 2024 18:35 — with GitHub Actions Error

avoid race condition in syncup model state refresh and handle NP of I…

b19c415

…sAutoDeployEnabled Signed-off-by: Xun Zhang <[email protected]>

Zhangxunmt force-pushed the main branch from db3eead to b19c415 Compare May 3, 2024 18:58

Zhangxunmt temporarily deployed to ml-commons-cicd-env May 3, 2024 18:58 — with GitHub Actions Inactive

ylwu-amzn reviewed May 3, 2024

View reviewed changes

Zhangxunmt temporarily deployed to ml-commons-cicd-env May 3, 2024 19:53 — with GitHub Actions Inactive

ylwu-amzn reviewed May 4, 2024

View reviewed changes

include the syncup response error messages as a string to help debug

7ae48a4

Signed-off-by: Xun Zhang <[email protected]>

Zhangxunmt had a problem deploying to ml-commons-cicd-env May 4, 2024 04:56 — with GitHub Actions Error

Zhangxunmt had a problem deploying to ml-commons-cicd-env May 4, 2024 04:56 — with GitHub Actions Failure

Zhangxunmt had a problem deploying to ml-commons-cicd-env May 4, 2024 04:56 — with GitHub Actions Error

Zhangxunmt temporarily deployed to ml-commons-cicd-env May 4, 2024 04:56 — with GitHub Actions Inactive

Zhangxunmt temporarily deployed to ml-commons-cicd-env May 4, 2024 05:51 — with GitHub Actions Inactive

ylwu-amzn approved these changes May 6, 2024

View reviewed changes

Zhangxunmt temporarily deployed to ml-commons-cicd-env May 6, 2024 16:07 — with GitHub Actions Inactive

jngz-es approved these changes May 6, 2024

View reviewed changes

Zhangxunmt merged commit 21bf079 into opensearch-project:main May 6, 2024
12 checks passed

opensearch-trigger-bot bot mentioned this pull request May 6, 2024

[Backport 2.x] avoid race condition in syncup model state refresh and handle NP of I… #2408

Merged

opensearch-trigger-bot bot mentioned this pull request May 6, 2024

[Backport 2.14] avoid race condition in syncup model state refresh and handle NP of I… #2409

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

avoid race condition in syncup model state refresh and handle NP of I… #2405

avoid race condition in syncup model state refresh and handle NP of I… #2405

Zhangxunmt commented May 3, 2024 •

edited by ylwu-amzn

Loading

ylwu-amzn May 3, 2024

Zhangxunmt May 3, 2024

ylwu-amzn May 3, 2024

Zhangxunmt May 3, 2024

ylwu-amzn May 4, 2024

Zhangxunmt May 4, 2024

avoid race condition in syncup model state refresh and handle NP of I… #2405

avoid race condition in syncup model state refresh and handle NP of I… #2405

Conversation

Zhangxunmt commented May 3, 2024 • edited by ylwu-amzn Loading

Description

Issues Resolved

Check List

ylwu-amzn May 3, 2024

Choose a reason for hiding this comment

Zhangxunmt May 3, 2024

Choose a reason for hiding this comment

ylwu-amzn May 3, 2024

Choose a reason for hiding this comment

Zhangxunmt May 3, 2024

Choose a reason for hiding this comment

ylwu-amzn May 4, 2024

Choose a reason for hiding this comment

Zhangxunmt May 4, 2024

Choose a reason for hiding this comment

Zhangxunmt commented May 3, 2024 •

edited by ylwu-amzn

Loading