Add TTL to un-deploy model automatically #2365

Zhangxunmt · 2024-04-26T15:50:46Z

Description

Reuse the current cron job and transport APIs to implement auto-undeploy ML_Models based on TTL after any model is un-used for a certain time. This is to mate the automatic remote model deployment and save memory for long idling local models. Verified in a single host cluster. Will verify in bigger cluster.

Issues Resolved

[https://github.com//issues/1343] , #1148

#2376

Check List

New functionality includes testing.
- All tests pass
New functionality has been documented.
- New functionality has javadoc added
Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

ylwu-amzn · 2024-04-27T17:41:32Z

plugin/src/main/java/org/opensearch/ml/cluster/MLSyncUpCron.java

@@ -154,6 +168,8 @@ public void run() {
                        log.error("Failed to sync model routing", ex);
                    })
                );
+            // Undeploy expired models
+            undeployExpiredModels(expiredModels.keySet(), modelWorkerNodes);


The current logic is very aggressive: if model expired on any node, we will undeploy it. It's possible that model expires on one node , but the other 99 nodes not yet.

Suggestion: Let's gather model last access time and TTL from each node. Then create a method to check if model expired on all nodes, we will undeploy model. This will also be flexible for future change. For example if we want to undeploy model when expired on more than half nodes, we can just change the method.

It costs too much to return lastAccessTime + TTL for each model and process all these info in the syncup Job, because we would need to construct a Map<modelId, Array of (lastAccessTime, TTL) for all nodes> and compute/compare for each entry. On the other hand, without adding new info, we can use the size of work nodes for expired model to do the same thing. If the model is expired in all deployed nodes, we will un-deploy.

ylwu-amzn · 2024-04-27T20:33:22Z

common/src/main/java/org/opensearch/ml/common/transport/sync/MLSyncUpNodeResponse.java

@@ -38,6 +40,7 @@ public MLSyncUpNodeResponse(StreamInput in) throws IOException {
        this.deployedModelIds = in.readOptionalStringArray();
        this.runningDeployModelIds = in.readOptionalStringArray();
        this.runningDeployModelTaskIds = in.readOptionalStringArray();
+        this.expiredModelIds = in.readOptionalStringArray();


Add BWC version, ask @b4sjoo who knows the best

Signed-off-by: Xun Zhang <[email protected]>

* add ttl to un-deploy model automatically Signed-off-by: Xun Zhang <[email protected]> * undeploy only for models that expired in all nodes Signed-off-by: Xun Zhang <[email protected]> * add bwc version for model ttl Signed-off-by: Xun Zhang <[email protected]> * only use minutes in ttl Signed-off-by: Xun Zhang <[email protected]> * move MINIMAL_SUPPORTED_VERSION_FOR_MODEL_TTL to MLDeploySetting Signed-off-by: Xun Zhang <[email protected]> --------- Signed-off-by: Xun Zhang <[email protected]> (cherry picked from commit e380395)

* add ttl to un-deploy model automatically Signed-off-by: Xun Zhang <[email protected]> * undeploy only for models that expired in all nodes Signed-off-by: Xun Zhang <[email protected]> * add bwc version for model ttl Signed-off-by: Xun Zhang <[email protected]> * only use minutes in ttl Signed-off-by: Xun Zhang <[email protected]> * move MINIMAL_SUPPORTED_VERSION_FOR_MODEL_TTL to MLDeploySetting Signed-off-by: Xun Zhang <[email protected]> --------- Signed-off-by: Xun Zhang <[email protected]> (cherry picked from commit e380395) Co-authored-by: Xun Zhang <[email protected]>

ylwu-amzn · 2024-04-30T22:40:52Z

plugin/src/main/java/org/opensearch/ml/cluster/MLSyncUpCron.java

+                if (modelWorkerNodes.containsKey(modelId)
+                    && expiredModelToNodes.get(modelId).size() == modelWorkerNodes.get(modelId).size()) {
+                    // this model has expired in all the nodes
+                    modelWorkerNodes.remove(modelId);


Cron job use modelWorkerNodes to check if model has been deployed to any nodes (check line 349 of this class). This will cause sync up cron job set model as deploy_failed.

ylwu-amzn · 2024-04-30T23:23:46Z

plugin/src/main/java/org/opensearch/ml/model/MLModelCacheHelper.java

+     */
+    public String[] getExpiredModels() {
+        return modelCaches.entrySet().stream().filter(entry -> {
+            MLModel mlModel = entry.getValue().getCachedModelInfo();


Cached model info only set when predict request comes. Check https://github.com/opensearch-project/ml-commons/pull/1472/files

For a deployed model, if no predict request comes, the cached model info is null. Then the mlModel will be null and line 356 will throw null pointer exception, then sync up cron job will get no response.

Add this line modelCacheHelper.setModelInfo(modelId, mlModel); when deploy model should fix the issue

…pensearch-project#2374) * add ttl to un-deploy model automatically Signed-off-by: Xun Zhang <[email protected]> * undeploy only for models that expired in all nodes Signed-off-by: Xun Zhang <[email protected]> * add bwc version for model ttl Signed-off-by: Xun Zhang <[email protected]> * only use minutes in ttl Signed-off-by: Xun Zhang <[email protected]> * move MINIMAL_SUPPORTED_VERSION_FOR_MODEL_TTL to MLDeploySetting Signed-off-by: Xun Zhang <[email protected]> --------- Signed-off-by: Xun Zhang <[email protected]> (cherry picked from commit e380395) Co-authored-by: Xun Zhang <[email protected]>

Zhangxunmt requested review from b4sjoo, dhrubo-os, jngz-es, model-collapse, rbhavna, ylwu-amzn, zane-neo, austintlee, HenryL27, sam-herman and xinyual as code owners April 26, 2024 15:50

Zhangxunmt had a problem deploying to ml-commons-cicd-env April 26, 2024 15:50 — with GitHub Actions Error

Zhangxunmt had a problem deploying to ml-commons-cicd-env April 26, 2024 15:50 — with GitHub Actions Failure

Zhangxunmt had a problem deploying to ml-commons-cicd-env April 26, 2024 15:51 — with GitHub Actions Error

Zhangxunmt had a problem deploying to ml-commons-cicd-env April 26, 2024 15:51 — with GitHub Actions Failure

Zhangxunmt force-pushed the main branch from b8777f7 to 160e24a Compare April 26, 2024 16:10

Zhangxunmt temporarily deployed to ml-commons-cicd-env April 26, 2024 16:10 — with GitHub Actions Inactive

Zhangxunmt had a problem deploying to ml-commons-cicd-env April 26, 2024 16:10 — with GitHub Actions Failure

Zhangxunmt had a problem deploying to ml-commons-cicd-env April 26, 2024 16:10 — with GitHub Actions Error

ylwu-amzn reviewed Apr 27, 2024

View reviewed changes

Zhangxunmt force-pushed the main branch from 160e24a to 3b682c9 Compare April 29, 2024 18:23

Zhangxunmt had a problem deploying to ml-commons-cicd-env April 29, 2024 18:24 — with GitHub Actions Error

Zhangxunmt had a problem deploying to ml-commons-cicd-env April 29, 2024 18:24 — with GitHub Actions Failure

Zhangxunmt had a problem deploying to ml-commons-cicd-env April 29, 2024 18:24 — with GitHub Actions Error

Zhangxunmt had a problem deploying to ml-commons-cicd-env April 30, 2024 00:09 — with GitHub Actions Error

Zhangxunmt had a problem deploying to ml-commons-cicd-env April 30, 2024 00:09 — with GitHub Actions Failure

Zhangxunmt had a problem deploying to ml-commons-cicd-env April 30, 2024 00:09 — with GitHub Actions Error

move MINIMAL_SUPPORTED_VERSION_FOR_MODEL_TTL to MLDeploySetting

8f7f6a9

Signed-off-by: Xun Zhang <[email protected]>

Zhangxunmt force-pushed the main branch from 7c1dc81 to 8f7f6a9 Compare April 30, 2024 01:36

Zhangxunmt temporarily deployed to ml-commons-cicd-env April 30, 2024 01:36 — with GitHub Actions Inactive

Zhangxunmt had a problem deploying to ml-commons-cicd-env April 30, 2024 01:36 — with GitHub Actions Error

Zhangxunmt had a problem deploying to ml-commons-cicd-env April 30, 2024 01:36 — with GitHub Actions Failure

Zhangxunmt temporarily deployed to ml-commons-cicd-env April 30, 2024 04:18 — with GitHub Actions Inactive

Zhangxunmt had a problem deploying to ml-commons-cicd-env April 30, 2024 04:18 — with GitHub Actions Error

Zhangxunmt had a problem deploying to ml-commons-cicd-env April 30, 2024 04:18 — with GitHub Actions Failure

ylwu-amzn approved these changes Apr 30, 2024

View reviewed changes

b4sjoo approved these changes Apr 30, 2024

View reviewed changes

Zhangxunmt merged commit e380395 into opensearch-project:main Apr 30, 2024
7 of 10 checks passed

opensearch-trigger-bot bot mentioned this pull request Apr 30, 2024

[Backport 2.x] Add TTL to un-deploy model automatically #2374

Merged

mingshl added v2.14.0 feature labels Apr 30, 2024

ylwu-amzn reviewed Apr 30, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add TTL to un-deploy model automatically #2365

Add TTL to un-deploy model automatically #2365

Zhangxunmt commented Apr 26, 2024 •

edited by ylwu-amzn

Loading

ylwu-amzn Apr 27, 2024

Zhangxunmt Apr 29, 2024

ylwu-amzn Apr 27, 2024

Zhangxunmt Apr 29, 2024 •

edited

Loading

ylwu-amzn Apr 30, 2024 •

edited

Loading

ylwu-amzn Apr 30, 2024

ylwu-amzn Apr 30, 2024

Add TTL to un-deploy model automatically #2365

Add TTL to un-deploy model automatically #2365

Conversation

Zhangxunmt commented Apr 26, 2024 • edited by ylwu-amzn Loading

Description

Issues Resolved

Check List

ylwu-amzn Apr 27, 2024

Choose a reason for hiding this comment

Zhangxunmt Apr 29, 2024

Choose a reason for hiding this comment

ylwu-amzn Apr 27, 2024

Choose a reason for hiding this comment

Zhangxunmt Apr 29, 2024 • edited Loading

Choose a reason for hiding this comment

ylwu-amzn Apr 30, 2024 • edited Loading

Choose a reason for hiding this comment

ylwu-amzn Apr 30, 2024

Choose a reason for hiding this comment

ylwu-amzn Apr 30, 2024

Choose a reason for hiding this comment

Zhangxunmt commented Apr 26, 2024 •

edited by ylwu-amzn

Loading

Zhangxunmt Apr 29, 2024 •

edited

Loading

ylwu-amzn Apr 30, 2024 •

edited

Loading