-
Notifications
You must be signed in to change notification settings - Fork 138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
avoid race condition in syncup model state refresh and handle NP of I… #2405
Conversation
…sAutoDeployEnabled Signed-off-by: Xun Zhang <[email protected]>
@@ -97,6 +99,9 @@ public void run() { | |||
// gather running model/tasks on nodes | |||
client.execute(MLSyncUpAction.INSTANCE, gatherInfoRequest, ActionListener.wrap(r -> { | |||
List<MLSyncUpNodeResponse> responses = r.getNodes(); | |||
if (r.failures() != null && r.failures().size() != 0) { | |||
log.debug("Received {} failures in the sync up response on nodes", r.failures().size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems not very helpful if we don't log the error message
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated in the refresh.
if (!modelsToUndeploy.isEmpty()) { | ||
// Undeploy expired models | ||
undeployExpiredModels(modelsToUndeploy, modelWorkerNodes, deployingModels); | ||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why return here ? If we have 10 models, modelsToUndeploy
has 4 models. We should also call refreshModelState
for the other 6, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The refreshModelState is called in the listener of unDeployModels after the expired models are un-deployed. Check the undeployExpiredModels(), So the modelstate will be refreshed anyways.
@@ -97,6 +99,9 @@ public void run() { | |||
// gather running model/tasks on nodes | |||
client.execute(MLSyncUpAction.INSTANCE, gatherInfoRequest, ActionListener.wrap(r -> { | |||
List<MLSyncUpNodeResponse> responses = r.getNodes(); | |||
if (r.failures() != null && r.failures().size() != 0) { | |||
log.error("Received {} failures in the sync up response on nodes", r.failures().size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you misunderstood me. I mean we should log the error message in r.failures
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the messages in the logging.
Signed-off-by: Xun Zhang <[email protected]>
#2405) * avoid race condition in syncup model state refresh and handle NP of IsAutoDeployEnabled Signed-off-by: Xun Zhang <[email protected]> * log the error message from syncUp response Signed-off-by: Xun Zhang <[email protected]> * include the syncup response error messages as a string to help debug Signed-off-by: Xun Zhang <[email protected]> --------- Signed-off-by: Xun Zhang <[email protected]> (cherry picked from commit 21bf079)
#2405) * avoid race condition in syncup model state refresh and handle NP of IsAutoDeployEnabled Signed-off-by: Xun Zhang <[email protected]> * log the error message from syncUp response Signed-off-by: Xun Zhang <[email protected]> * include the syncup response error messages as a string to help debug Signed-off-by: Xun Zhang <[email protected]> --------- Signed-off-by: Xun Zhang <[email protected]> (cherry picked from commit 21bf079)
#2405) (#2408) * avoid race condition in syncup model state refresh and handle NP of IsAutoDeployEnabled Signed-off-by: Xun Zhang <[email protected]> * log the error message from syncUp response Signed-off-by: Xun Zhang <[email protected]> * include the syncup response error messages as a string to help debug Signed-off-by: Xun Zhang <[email protected]> --------- Signed-off-by: Xun Zhang <[email protected]> (cherry picked from commit 21bf079) Co-authored-by: Xun Zhang <[email protected]>
#2405) (#2409) * avoid race condition in syncup model state refresh and handle NP of IsAutoDeployEnabled Signed-off-by: Xun Zhang <[email protected]> * log the error message from syncUp response Signed-off-by: Xun Zhang <[email protected]> * include the syncup response error messages as a string to help debug Signed-off-by: Xun Zhang <[email protected]> --------- Signed-off-by: Xun Zhang <[email protected]> (cherry picked from commit 21bf079) Co-authored-by: Xun Zhang <[email protected]>
opensearch-project#2405) (opensearch-project#2408) * avoid race condition in syncup model state refresh and handle NP of IsAutoDeployEnabled Signed-off-by: Xun Zhang <[email protected]> * log the error message from syncUp response Signed-off-by: Xun Zhang <[email protected]> * include the syncup response error messages as a string to help debug Signed-off-by: Xun Zhang <[email protected]> --------- Signed-off-by: Xun Zhang <[email protected]> (cherry picked from commit 21bf079) Co-authored-by: Xun Zhang <[email protected]>
Description
Avoid race condition in sync up job to un-deploy expired models and refresh model states. Without this change, the sync up job sometimes sets wrong Model State after model is un-deployed. Default AutoDeploy to enabled when it's a null value.
Verified and tested in local 2.14 cluster.
Issues Resolved
Related to #2376
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.