Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add TTL to un-deploy model automatically #2365

Merged
merged 5 commits into from
Apr 30, 2024

Conversation

Zhangxunmt
Copy link
Collaborator

@Zhangxunmt Zhangxunmt commented Apr 26, 2024

Description

Reuse the current cron job and transport APIs to implement auto-undeploy ML_Models based on TTL after any model is un-used for a certain time. This is to mate the automatic remote model deployment and save memory for long idling local models. Verified in a single host cluster. Will verify in bigger cluster.

Issues Resolved

[https://github.com//issues/1343] , #1148

#2376

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@Zhangxunmt Zhangxunmt had a problem deploying to ml-commons-cicd-env April 26, 2024 15:50 — with GitHub Actions Failure
@Zhangxunmt Zhangxunmt had a problem deploying to ml-commons-cicd-env April 26, 2024 15:51 — with GitHub Actions Failure
@Zhangxunmt Zhangxunmt temporarily deployed to ml-commons-cicd-env April 26, 2024 16:10 — with GitHub Actions Inactive
@Zhangxunmt Zhangxunmt temporarily deployed to ml-commons-cicd-env April 26, 2024 16:10 — with GitHub Actions Inactive
@Zhangxunmt Zhangxunmt temporarily deployed to ml-commons-cicd-env April 26, 2024 16:10 — with GitHub Actions Inactive
@Zhangxunmt Zhangxunmt had a problem deploying to ml-commons-cicd-env April 26, 2024 16:10 — with GitHub Actions Failure
@@ -154,6 +168,8 @@ public void run() {
log.error("Failed to sync model routing", ex);
})
);
// Undeploy expired models
undeployExpiredModels(expiredModels.keySet(), modelWorkerNodes);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current logic is very aggressive: if model expired on any node, we will undeploy it. It's possible that model expires on one node , but the other 99 nodes not yet.

Suggestion: Let's gather model last access time and TTL from each node. Then create a method to check if model expired on all nodes, we will undeploy model. This will also be flexible for future change. For example if we want to undeploy model when expired on more than half nodes, we can just change the method.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It costs too much to return lastAccessTime + TTL for each model and process all these info in the syncup Job, because we would need to construct a Map<modelId, Array of (lastAccessTime, TTL) for all nodes> and compute/compare for each entry. On the other hand, without adding new info, we can use the size of work nodes for expired model to do the same thing. If the model is expired in all deployed nodes, we will un-deploy.

@@ -38,6 +40,7 @@ public MLSyncUpNodeResponse(StreamInput in) throws IOException {
this.deployedModelIds = in.readOptionalStringArray();
this.runningDeployModelIds = in.readOptionalStringArray();
this.runningDeployModelTaskIds = in.readOptionalStringArray();
this.expiredModelIds = in.readOptionalStringArray();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add BWC version, ask @b4sjoo who knows the best

Copy link
Collaborator Author

@Zhangxunmt Zhangxunmt Apr 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

@Zhangxunmt Zhangxunmt merged commit e380395 into opensearch-project:main Apr 30, 2024
7 of 10 checks passed
opensearch-trigger-bot bot pushed a commit that referenced this pull request Apr 30, 2024
* add ttl to un-deploy model automatically

Signed-off-by: Xun Zhang <[email protected]>

* undeploy only for models that expired in all nodes

Signed-off-by: Xun Zhang <[email protected]>

* add bwc version for model ttl

Signed-off-by: Xun Zhang <[email protected]>

* only use minutes in ttl

Signed-off-by: Xun Zhang <[email protected]>

* move MINIMAL_SUPPORTED_VERSION_FOR_MODEL_TTL to MLDeploySetting

Signed-off-by: Xun Zhang <[email protected]>

---------

Signed-off-by: Xun Zhang <[email protected]>
(cherry picked from commit e380395)
Zhangxunmt added a commit that referenced this pull request Apr 30, 2024
* add ttl to un-deploy model automatically

Signed-off-by: Xun Zhang <[email protected]>

* undeploy only for models that expired in all nodes

Signed-off-by: Xun Zhang <[email protected]>

* add bwc version for model ttl

Signed-off-by: Xun Zhang <[email protected]>

* only use minutes in ttl

Signed-off-by: Xun Zhang <[email protected]>

* move MINIMAL_SUPPORTED_VERSION_FOR_MODEL_TTL to MLDeploySetting

Signed-off-by: Xun Zhang <[email protected]>

---------

Signed-off-by: Xun Zhang <[email protected]>
(cherry picked from commit e380395)

Co-authored-by: Xun Zhang <[email protected]>
if (modelWorkerNodes.containsKey(modelId)
&& expiredModelToNodes.get(modelId).size() == modelWorkerNodes.get(modelId).size()) {
// this model has expired in all the nodes
modelWorkerNodes.remove(modelId);
Copy link
Collaborator

@ylwu-amzn ylwu-amzn Apr 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cron job use modelWorkerNodes to check if model has been deployed to any nodes (check line 349 of this class). This will cause sync up cron job set model as deploy_failed.

*/
public String[] getExpiredModels() {
return modelCaches.entrySet().stream().filter(entry -> {
MLModel mlModel = entry.getValue().getCachedModelInfo();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cached model info only set when predict request comes. Check https://github.com/opensearch-project/ml-commons/pull/1472/files

For a deployed model, if no predict request comes, the cached model info is null. Then the mlModel will be null and line 356 will throw null pointer exception, then sync up cron job will get no response.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add this line modelCacheHelper.setModelInfo(modelId, mlModel); when deploy model should fix the issue

dhrubo-os pushed a commit to dhrubo-os/ml-commons that referenced this pull request May 17, 2024
…pensearch-project#2374)

* add ttl to un-deploy model automatically

Signed-off-by: Xun Zhang <[email protected]>

* undeploy only for models that expired in all nodes

Signed-off-by: Xun Zhang <[email protected]>

* add bwc version for model ttl

Signed-off-by: Xun Zhang <[email protected]>

* only use minutes in ttl

Signed-off-by: Xun Zhang <[email protected]>

* move MINIMAL_SUPPORTED_VERSION_FOR_MODEL_TTL to MLDeploySetting

Signed-off-by: Xun Zhang <[email protected]>

---------

Signed-off-by: Xun Zhang <[email protected]>
(cherry picked from commit e380395)

Co-authored-by: Xun Zhang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants