-
Notifications
You must be signed in to change notification settings - Fork 138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Intermittent Model Deployment Issues in AWS OpenSearch: "Model not ready yet" Error #2050
Comments
I'll move this to the ml plugin. Did you get a chance to open a ticket with AWS yet? |
@dblock , thanks. |
@ulan-yisaev , Thanks for reporting this issue.
Have you enabled the auto redeploy setting ? https://opensearch.org/docs/latest/ml-commons-plugin/cluster-settings/#enable-auto-redeploy There is one known issue for model auto redeploy can't work for cluster restart case which fixed in this PR #1627, the fix will be in 2.12 |
Hello @ylwu-amzn! Thank you for your response and the information provided. I have indeed enabled the auto redeploy setting. However, after some time, I can confirm that this setting does not seem to resolve the issue. The models again disappeared from the deployment, and I had to redeploy them manually. It appears that the auto redeploy feature is not functioning as expected, at least in my setup. I appreciate the reference to PR #1627 and look forward to the fix in version 2.12. In the meantime, if there are any other suggestions or workarounds to mitigate this issue, I would be grateful to hear them. |
One workaround is using Lambda function to regularly check the model status and auto redeploy it if the model is undeployed. Deploy API is idempotent, so you can also just run it regularly. We built a sample lambda function. You can try this. Follow the readme in the zip. |
Thank you for the workaround suggestion and for providing the sample Lambda function. I appreciate your support. |
@Zhangxunmt has some proposal #1148 to automatically deploy model when predict request comes and model not deployed. He has a PoC ready. I think that way could solve your problem. But feel free to come up with different options and discuss. |
@ulan-yisaev I see you used remote model in this case so I think the auto-deploy with TTL strategy will work for you. The auto-deploy of local/customer model is not in the scope yet since that will block the "Predict" process and adds more confusion. In short, manual deployment of local/custom models are still required. But the remote models will be included in the auto-deploy feature with TTL. |
@ulan-yisaev The remote model auto deploy is release in 2.13. Auto model undeploy with TTL is release in 2.14. There shouldn't be any problem in this issue anymore. Closing now. Feel free to reopen. |
Describe the bug
Issue Description:
After successfully deploying a machine learning model in AWS OpenSearch using the POST request to
/_plugins/_ml/models/WgtRZY0BjrbwPvlSkd2A/_deploy
, the model initially shows as deployed when checked withGET {{opensearch_host}}/_plugins/_ml/profile/models
:However, after a certain period, possibly following cluster restarts, the model becomes unavailable, leading to an error:
RequestError(400, 'illegal_argument_exception', 'Model not ready yet. Please run this first: POST /_plugins/_ml/models/WgtRZY0BjrbwPvlSkd2A/_deploy')
. Additionally, the model profile then shows empty results, indicating the model is no longer deployed.Related component
Plugins
To Reproduce
POST {{opensearch_host}}/_plugins/_ml/models/WgtRZY0BjrbwPvlSkd2A/_deploy
.GET {{opensearch_host}}/_plugins/_ml/profile/models
, which shows the model state as "DEPLOYED".RequestError(400, 'illegal_argument_exception', 'Model not ready yet...')
when attempting to use the model.GET {{opensearch_host}}/_plugins/_ml/profile/models
, which now returns empty results for the model.Expected behavior
Once a model is deployed, it should remain available for use without needing re-deployment, especially after cluster restarts. The model's status should consistently reflect its actual state, and any errors related to deployment should be clearly explained or resolved automatically by the system.
Additional Details
Plugins
Screenshots
If applicable, add screenshots to help explain your problem.
Host/Environment (please complete the following information):
Additional context
This issue seems to occur intermittently and might be linked to cluster restarts or other changes in the AWS OpenSearch environment. It impacts the reliability of machine learning model deployments in production settings.
The text was updated successfully, but these errors were encountered: