-
Notifications
You must be signed in to change notification settings - Fork 138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add ML task timeout setting and clean up expired tasks from cache #662
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Signed-off-by: Yaliang Wu <[email protected]>
Codecov Report
@@ Coverage Diff @@
## 2.x #662 +/- ##
============================================
+ Coverage 83.90% 83.96% +0.05%
- Complexity 987 1008 +21
============================================
Files 93 93
Lines 3597 3660 +63
Branches 327 342 +15
============================================
+ Hits 3018 3073 +55
- Misses 440 443 +3
- Partials 139 144 +5
Flags with carried forward coverage won't be shown. Click here to find out more.
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
rbhavna
previously approved these changes
Jan 6, 2023
jngz-es
reviewed
Jan 6, 2023
plugin/src/main/java/org/opensearch/ml/action/syncup/TransportSyncUpOnNodeAction.java
Outdated
Show resolved
Hide resolved
Signed-off-by: Yaliang Wu <[email protected]>
rbhavna
previously approved these changes
Jan 6, 2023
jngz-es
previously approved these changes
Jan 6, 2023
Signed-off-by: Yaliang Wu <[email protected]>
rbhavna
approved these changes
Jan 6, 2023
jngz-es
approved these changes
Jan 6, 2023
ylwu-amzn
added a commit
to ylwu-amzn/ml-commons
that referenced
this pull request
Feb 17, 2023
…ensearch-project#662) * add ML task timeout setting and clean up expired tasks from cache Signed-off-by: Yaliang Wu <[email protected]> * add log for corner case Signed-off-by: Yaliang Wu <[email protected]> * rollback setting name change to avoid breaking bwc Signed-off-by: Yaliang Wu <[email protected]> Signed-off-by: Yaliang Wu <[email protected]>
ylwu-amzn
added a commit
to ylwu-amzn/ml-commons
that referenced
this pull request
Mar 2, 2023
…ensearch-project#662) * add ML task timeout setting and clean up expired tasks from cache Signed-off-by: Yaliang Wu <[email protected]> * add log for corner case Signed-off-by: Yaliang Wu <[email protected]> * rollback setting name change to avoid breaking bwc Signed-off-by: Yaliang Wu <[email protected]> Signed-off-by: Yaliang Wu <[email protected]>
Merged
5 tasks
ylwu-amzn
added a commit
that referenced
this pull request
Mar 2, 2023
…asks from cache (#662) (#770) * add ML task timeout setting and clean up expired tasks from cache (#662) * add ML task timeout setting and clean up expired tasks from cache Signed-off-by: Yaliang Wu <[email protected]> * add log for corner case Signed-off-by: Yaliang Wu <[email protected]> * rollback setting name change to avoid breaking bwc Signed-off-by: Yaliang Wu <[email protected]> Signed-off-by: Yaliang Wu <[email protected]> * fix code format Signed-off-by: Yaliang Wu <[email protected]> --------- Signed-off-by: Yaliang Wu <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Signed-off-by: Yaliang Wu [email protected]
Description
For some edge case like worker node crashed, the ML task may stay in the cache forever. This PR fixed that issue by adding timeout setting (default 10 minutes). Sync up job will check ML task in cache expired or not. If expired, will reset task and model status and remove task from cache to avoid memory leak.
Also fixed one issue for reloading a model which already loaded on some node. For example, user can load model to node1, then target worker nodes is
[ node1 ]
, after model loaded, user may load model again on node2. Then target worker node becomes[ node2 ]
. This is not expected. This PR checks if the new target worker nodes include all old loaded nodes (for this example, it's[ node 1]
). If not, will throw exception. So if user load model again, they can input[ node1, node2 ]
in load model API, and the target worker nodes will be[ node1, node2 ]
. If user input[ node2 ]
in load model API, will throw exception asnode1
not included.Issues Resolved
[List any issues this PR will resolve]
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.