Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add ML task timeout setting and clean up expired tasks from cache #662

Merged
merged 3 commits into from
Jan 6, 2023

Conversation

ylwu-amzn
Copy link
Collaborator

@ylwu-amzn ylwu-amzn commented Jan 6, 2023

Signed-off-by: Yaliang Wu [email protected]

Description

For some edge case like worker node crashed, the ML task may stay in the cache forever. This PR fixed that issue by adding timeout setting (default 10 minutes). Sync up job will check ML task in cache expired or not. If expired, will reset task and model status and remove task from cache to avoid memory leak.

Also fixed one issue for reloading a model which already loaded on some node. For example, user can load model to node1, then target worker nodes is [ node1 ], after model loaded, user may load model again on node2. Then target worker node becomes [ node2 ]. This is not expected. This PR checks if the new target worker nodes include all old loaded nodes (for this example, it's [ node 1]). If not, will throw exception. So if user load model again, they can input [ node1, node2 ] in load model API, and the target worker nodes will be [ node1, node2 ]. If user input [ node2 ] in load model API, will throw exception as node1 not included.

Issues Resolved

[List any issues this PR will resolve]

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@ylwu-amzn ylwu-amzn requested a review from a team January 6, 2023 00:52
@codecov-commenter
Copy link

codecov-commenter commented Jan 6, 2023

Codecov Report

Merging #662 (d895574) into 2.x (d18fd43) will increase coverage by 0.05%.
The diff coverage is 60.60%.

@@             Coverage Diff              @@
##                2.x     #662      +/-   ##
============================================
+ Coverage     83.90%   83.96%   +0.05%     
- Complexity      987     1008      +21     
============================================
  Files            93       93              
  Lines          3597     3660      +63     
  Branches        327      342      +15     
============================================
+ Hits           3018     3073      +55     
- Misses          440      443       +3     
- Partials        139      144       +5     
Flag Coverage Δ
ml-commons 83.96% <60.60%> (+0.05%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...va/org/opensearch/ml/model/MLModelCacheHelper.java 92.06% <0.00%> (-3.02%) ⬇️
...rg/opensearch/ml/plugin/MachineLearningPlugin.java 100.00% <ø> (ø)
...earch/ml/action/load/TransportLoadModelAction.java 83.78% <11.11%> (-6.32%) ⬇️
...ain/java/org/opensearch/ml/task/MLTaskManager.java 61.92% <33.33%> (+1.71%) ⬆️
...n/java/org/opensearch/ml/model/MLModelManager.java 78.20% <60.00%> (-0.57%) ⬇️
.../ml/action/syncup/TransportSyncUpOnNodeAction.java 89.79% <81.25%> (+1.91%) ⬆️
.../org/opensearch/ml/settings/MLCommonsSettings.java 100.00% <100.00%> (ø)
... and 2 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

rbhavna
rbhavna previously approved these changes Jan 6, 2023
Signed-off-by: Yaliang Wu <[email protected]>
rbhavna
rbhavna previously approved these changes Jan 6, 2023
jngz-es
jngz-es previously approved these changes Jan 6, 2023
@ylwu-amzn ylwu-amzn dismissed stale reviews from jngz-es and rbhavna via d895574 January 6, 2023 07:14
@ylwu-amzn ylwu-amzn merged commit c220af3 into opensearch-project:2.x Jan 6, 2023
@b4sjoo b4sjoo added the enhancement New feature or request label Jan 10, 2023
ylwu-amzn added a commit to ylwu-amzn/ml-commons that referenced this pull request Feb 17, 2023
…ensearch-project#662)

* add ML task timeout setting and clean up expired tasks from cache

Signed-off-by: Yaliang Wu <[email protected]>

* add log for corner case

Signed-off-by: Yaliang Wu <[email protected]>

* rollback setting name change to avoid breaking bwc

Signed-off-by: Yaliang Wu <[email protected]>

Signed-off-by: Yaliang Wu <[email protected]>
ylwu-amzn added a commit to ylwu-amzn/ml-commons that referenced this pull request Mar 2, 2023
…ensearch-project#662)

* add ML task timeout setting and clean up expired tasks from cache

Signed-off-by: Yaliang Wu <[email protected]>

* add log for corner case

Signed-off-by: Yaliang Wu <[email protected]>

* rollback setting name change to avoid breaking bwc

Signed-off-by: Yaliang Wu <[email protected]>

Signed-off-by: Yaliang Wu <[email protected]>
ylwu-amzn added a commit that referenced this pull request Mar 2, 2023
…asks from cache (#662) (#770)

* add ML task timeout setting and clean up expired tasks from cache (#662)

* add ML task timeout setting and clean up expired tasks from cache

Signed-off-by: Yaliang Wu <[email protected]>

* add log for corner case

Signed-off-by: Yaliang Wu <[email protected]>

* rollback setting name change to avoid breaking bwc

Signed-off-by: Yaliang Wu <[email protected]>

Signed-off-by: Yaliang Wu <[email protected]>

* fix code format

Signed-off-by: Yaliang Wu <[email protected]>

---------

Signed-off-by: Yaliang Wu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants