-
Notifications
You must be signed in to change notification settings - Fork 138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add TTL to un-deploy model automatically #2365
Conversation
@@ -154,6 +168,8 @@ public void run() { | |||
log.error("Failed to sync model routing", ex); | |||
}) | |||
); | |||
// Undeploy expired models | |||
undeployExpiredModels(expiredModels.keySet(), modelWorkerNodes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current logic is very aggressive: if model expired on any node, we will undeploy it. It's possible that model expires on one node , but the other 99 nodes not yet.
Suggestion: Let's gather model last access time and TTL from each node. Then create a method to check if model expired on all nodes, we will undeploy model. This will also be flexible for future change. For example if we want to undeploy model when expired on more than half nodes, we can just change the method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It costs too much to return lastAccessTime + TTL for each model and process all these info in the syncup Job, because we would need to construct a Map<modelId, Array of (lastAccessTime, TTL) for all nodes> and compute/compare for each entry. On the other hand, without adding new info, we can use the size of work nodes for expired model to do the same thing. If the model is expired in all deployed nodes, we will un-deploy.
@@ -38,6 +40,7 @@ public MLSyncUpNodeResponse(StreamInput in) throws IOException { | |||
this.deployedModelIds = in.readOptionalStringArray(); | |||
this.runningDeployModelIds = in.readOptionalStringArray(); | |||
this.runningDeployModelTaskIds = in.readOptionalStringArray(); | |||
this.expiredModelIds = in.readOptionalStringArray(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add BWC version, ask @b4sjoo who knows the best
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
Signed-off-by: Xun Zhang <[email protected]>
* add ttl to un-deploy model automatically Signed-off-by: Xun Zhang <[email protected]> * undeploy only for models that expired in all nodes Signed-off-by: Xun Zhang <[email protected]> * add bwc version for model ttl Signed-off-by: Xun Zhang <[email protected]> * only use minutes in ttl Signed-off-by: Xun Zhang <[email protected]> * move MINIMAL_SUPPORTED_VERSION_FOR_MODEL_TTL to MLDeploySetting Signed-off-by: Xun Zhang <[email protected]> --------- Signed-off-by: Xun Zhang <[email protected]> (cherry picked from commit e380395)
* add ttl to un-deploy model automatically Signed-off-by: Xun Zhang <[email protected]> * undeploy only for models that expired in all nodes Signed-off-by: Xun Zhang <[email protected]> * add bwc version for model ttl Signed-off-by: Xun Zhang <[email protected]> * only use minutes in ttl Signed-off-by: Xun Zhang <[email protected]> * move MINIMAL_SUPPORTED_VERSION_FOR_MODEL_TTL to MLDeploySetting Signed-off-by: Xun Zhang <[email protected]> --------- Signed-off-by: Xun Zhang <[email protected]> (cherry picked from commit e380395) Co-authored-by: Xun Zhang <[email protected]>
if (modelWorkerNodes.containsKey(modelId) | ||
&& expiredModelToNodes.get(modelId).size() == modelWorkerNodes.get(modelId).size()) { | ||
// this model has expired in all the nodes | ||
modelWorkerNodes.remove(modelId); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cron job use modelWorkerNodes to check if model has been deployed to any nodes (check line 349 of this class). This will cause sync up cron job set model as deploy_failed.
*/ | ||
public String[] getExpiredModels() { | ||
return modelCaches.entrySet().stream().filter(entry -> { | ||
MLModel mlModel = entry.getValue().getCachedModelInfo(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cached model info only set when predict request comes. Check https://github.com/opensearch-project/ml-commons/pull/1472/files
For a deployed model, if no predict request comes, the cached model info is null. Then the mlModel
will be null and line 356 will throw null pointer exception, then sync up cron job will get no response.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add this line modelCacheHelper.setModelInfo(modelId, mlModel);
when deploy model should fix the issue
…pensearch-project#2374) * add ttl to un-deploy model automatically Signed-off-by: Xun Zhang <[email protected]> * undeploy only for models that expired in all nodes Signed-off-by: Xun Zhang <[email protected]> * add bwc version for model ttl Signed-off-by: Xun Zhang <[email protected]> * only use minutes in ttl Signed-off-by: Xun Zhang <[email protected]> * move MINIMAL_SUPPORTED_VERSION_FOR_MODEL_TTL to MLDeploySetting Signed-off-by: Xun Zhang <[email protected]> --------- Signed-off-by: Xun Zhang <[email protected]> (cherry picked from commit e380395) Co-authored-by: Xun Zhang <[email protected]>
Description
Reuse the current cron job and transport APIs to implement auto-undeploy ML_Models based on TTL after any model is un-used for a certain time. This is to mate the automatic remote model deployment and save memory for long idling local models. Verified in a single host cluster. Will verify in bigger cluster.
Issues Resolved
[https://github.com//issues/1343] , #1148
#2376
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.