Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

avoid race condition in syncup model state refresh and handle NP of I… #2405

Merged
merged 3 commits into from
May 6, 2024

Conversation

Zhangxunmt
Copy link
Collaborator

@Zhangxunmt Zhangxunmt commented May 3, 2024

Description

Avoid race condition in sync up job to un-deploy expired models and refresh model states. Without this change, the sync up job sometimes sets wrong Model State after model is un-deployed. Default AutoDeploy to enabled when it's a null value.

Verified and tested in local 2.14 cluster.

Issues Resolved

Related to #2376

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@Zhangxunmt Zhangxunmt temporarily deployed to ml-commons-cicd-env May 3, 2024 18:58 — with GitHub Actions Inactive
@Zhangxunmt Zhangxunmt temporarily deployed to ml-commons-cicd-env May 3, 2024 18:58 — with GitHub Actions Inactive
@Zhangxunmt Zhangxunmt temporarily deployed to ml-commons-cicd-env May 3, 2024 18:58 — with GitHub Actions Inactive
@Zhangxunmt Zhangxunmt temporarily deployed to ml-commons-cicd-env May 3, 2024 18:58 — with GitHub Actions Inactive
@Zhangxunmt Zhangxunmt temporarily deployed to ml-commons-cicd-env May 3, 2024 18:58 — with GitHub Actions Inactive
@Zhangxunmt Zhangxunmt temporarily deployed to ml-commons-cicd-env May 3, 2024 18:58 — with GitHub Actions Inactive
@@ -97,6 +99,9 @@ public void run() {
// gather running model/tasks on nodes
client.execute(MLSyncUpAction.INSTANCE, gatherInfoRequest, ActionListener.wrap(r -> {
List<MLSyncUpNodeResponse> responses = r.getNodes();
if (r.failures() != null && r.failures().size() != 0) {
log.debug("Received {} failures in the sync up response on nodes", r.failures().size());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems not very helpful if we don't log the error message

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated in the refresh.

if (!modelsToUndeploy.isEmpty()) {
// Undeploy expired models
undeployExpiredModels(modelsToUndeploy, modelWorkerNodes, deployingModels);
return;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why return here ? If we have 10 models, modelsToUndeploy has 4 models. We should also call refreshModelState for the other 6, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The refreshModelState is called in the listener of unDeployModels after the expired models are un-deployed. Check the undeployExpiredModels(), So the modelstate will be refreshed anyways.

@Zhangxunmt Zhangxunmt temporarily deployed to ml-commons-cicd-env May 3, 2024 19:53 — with GitHub Actions Inactive
@@ -97,6 +99,9 @@ public void run() {
// gather running model/tasks on nodes
client.execute(MLSyncUpAction.INSTANCE, gatherInfoRequest, ActionListener.wrap(r -> {
List<MLSyncUpNodeResponse> responses = r.getNodes();
if (r.failures() != null && r.failures().size() != 0) {
log.error("Received {} failures in the sync up response on nodes", r.failures().size());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you misunderstood me. I mean we should log the error message in r.failures

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the messages in the logging.

@Zhangxunmt Zhangxunmt had a problem deploying to ml-commons-cicd-env May 4, 2024 04:56 — with GitHub Actions Error
@Zhangxunmt Zhangxunmt had a problem deploying to ml-commons-cicd-env May 4, 2024 04:56 — with GitHub Actions Failure
@Zhangxunmt Zhangxunmt had a problem deploying to ml-commons-cicd-env May 4, 2024 04:56 — with GitHub Actions Error
@Zhangxunmt Zhangxunmt temporarily deployed to ml-commons-cicd-env May 4, 2024 04:56 — with GitHub Actions Inactive
@Zhangxunmt Zhangxunmt temporarily deployed to ml-commons-cicd-env May 4, 2024 04:56 — with GitHub Actions Inactive
@Zhangxunmt Zhangxunmt temporarily deployed to ml-commons-cicd-env May 4, 2024 04:56 — with GitHub Actions Inactive
@Zhangxunmt Zhangxunmt temporarily deployed to ml-commons-cicd-env May 4, 2024 05:51 — with GitHub Actions Inactive
@Zhangxunmt Zhangxunmt temporarily deployed to ml-commons-cicd-env May 4, 2024 05:51 — with GitHub Actions Inactive
@Zhangxunmt Zhangxunmt temporarily deployed to ml-commons-cicd-env May 4, 2024 05:51 — with GitHub Actions Inactive
@Zhangxunmt Zhangxunmt temporarily deployed to ml-commons-cicd-env May 6, 2024 16:07 — with GitHub Actions Inactive
@Zhangxunmt Zhangxunmt temporarily deployed to ml-commons-cicd-env May 6, 2024 16:07 — with GitHub Actions Inactive
@Zhangxunmt Zhangxunmt temporarily deployed to ml-commons-cicd-env May 6, 2024 16:07 — with GitHub Actions Inactive
@Zhangxunmt Zhangxunmt temporarily deployed to ml-commons-cicd-env May 6, 2024 16:07 — with GitHub Actions Inactive
@Zhangxunmt Zhangxunmt temporarily deployed to ml-commons-cicd-env May 6, 2024 16:07 — with GitHub Actions Inactive
@Zhangxunmt Zhangxunmt temporarily deployed to ml-commons-cicd-env May 6, 2024 16:07 — with GitHub Actions Inactive
@Zhangxunmt Zhangxunmt temporarily deployed to ml-commons-cicd-env May 6, 2024 16:07 — with GitHub Actions Inactive
@Zhangxunmt Zhangxunmt temporarily deployed to ml-commons-cicd-env May 6, 2024 16:07 — with GitHub Actions Inactive
@Zhangxunmt Zhangxunmt temporarily deployed to ml-commons-cicd-env May 6, 2024 16:07 — with GitHub Actions Inactive
@Zhangxunmt Zhangxunmt merged commit 21bf079 into opensearch-project:main May 6, 2024
12 checks passed
opensearch-trigger-bot bot pushed a commit that referenced this pull request May 6, 2024
#2405)

* avoid race condition in syncup model state refresh and handle NP of IsAutoDeployEnabled

Signed-off-by: Xun Zhang <[email protected]>

* log the error message from syncUp response

Signed-off-by: Xun Zhang <[email protected]>

* include the syncup response error messages as a string to help debug

Signed-off-by: Xun Zhang <[email protected]>

---------

Signed-off-by: Xun Zhang <[email protected]>
(cherry picked from commit 21bf079)
opensearch-trigger-bot bot pushed a commit that referenced this pull request May 6, 2024
#2405)

* avoid race condition in syncup model state refresh and handle NP of IsAutoDeployEnabled

Signed-off-by: Xun Zhang <[email protected]>

* log the error message from syncUp response

Signed-off-by: Xun Zhang <[email protected]>

* include the syncup response error messages as a string to help debug

Signed-off-by: Xun Zhang <[email protected]>

---------

Signed-off-by: Xun Zhang <[email protected]>
(cherry picked from commit 21bf079)
Zhangxunmt added a commit that referenced this pull request May 6, 2024
#2405) (#2408)

* avoid race condition in syncup model state refresh and handle NP of IsAutoDeployEnabled

Signed-off-by: Xun Zhang <[email protected]>

* log the error message from syncUp response

Signed-off-by: Xun Zhang <[email protected]>

* include the syncup response error messages as a string to help debug

Signed-off-by: Xun Zhang <[email protected]>

---------

Signed-off-by: Xun Zhang <[email protected]>
(cherry picked from commit 21bf079)

Co-authored-by: Xun Zhang <[email protected]>
Zhangxunmt added a commit that referenced this pull request May 6, 2024
#2405) (#2409)

* avoid race condition in syncup model state refresh and handle NP of IsAutoDeployEnabled

Signed-off-by: Xun Zhang <[email protected]>

* log the error message from syncUp response

Signed-off-by: Xun Zhang <[email protected]>

* include the syncup response error messages as a string to help debug

Signed-off-by: Xun Zhang <[email protected]>

---------

Signed-off-by: Xun Zhang <[email protected]>
(cherry picked from commit 21bf079)

Co-authored-by: Xun Zhang <[email protected]>
dhrubo-os pushed a commit to dhrubo-os/ml-commons that referenced this pull request May 17, 2024
opensearch-project#2405) (opensearch-project#2408)

* avoid race condition in syncup model state refresh and handle NP of IsAutoDeployEnabled

Signed-off-by: Xun Zhang <[email protected]>

* log the error message from syncUp response

Signed-off-by: Xun Zhang <[email protected]>

* include the syncup response error messages as a string to help debug

Signed-off-by: Xun Zhang <[email protected]>

---------

Signed-off-by: Xun Zhang <[email protected]>
(cherry picked from commit 21bf079)

Co-authored-by: Xun Zhang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants