Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Intermittent Model Deployment Issues in AWS OpenSearch: "Model not ready yet" Error #2050

Closed
ulan-yisaev opened this issue Feb 7, 2024 · 9 comments
Assignees
Labels
bug Something isn't working

Comments

@ulan-yisaev
Copy link
Contributor

Describe the bug

Issue Description:
After successfully deploying a machine learning model in AWS OpenSearch using the POST request to /_plugins/_ml/models/WgtRZY0BjrbwPvlSkd2A/_deploy, the model initially shows as deployed when checked with
GET {{opensearch_host}}/_plugins/_ml/profile/models:

{
    "nodes": {
        "eU2R8fKxSMazHB6HlFbpNg": {
            "models": {
                "WgtRZY0BjrbwPvlSkd2A": {
                    "model_state": "DEPLOYED",
                    "predictor": "org.opensearch.ml.engine.algorithms.remote.RemoteModel@405c0c42",
                    "target_worker_nodes": [
                        "GlBdA3joT4ebA2Wj5yd_8Q",
                        "eU2R8fKxSMazHB6HlFbpNg"
                    ],
                    "worker_nodes": [
                        "GlBdA3joT4ebA2Wj5yd_8Q",
                        "eU2R8fKxSMazHB6HlFbpNg"
                    ]
                }
            }
        },
        "GlBdA3joT4ebA2Wj5yd_8Q": {
            "models": {
                "WgtRZY0BjrbwPvlSkd2A": {
                    "model_state": "DEPLOYED",
                    "predictor": "org.opensearch.ml.engine.algorithms.remote.RemoteModel@85b1387",
                    "target_worker_nodes": [
                        "GlBdA3joT4ebA2Wj5yd_8Q",
                        "eU2R8fKxSMazHB6HlFbpNg"
                    ],
                    "worker_nodes": [
                        "GlBdA3joT4ebA2Wj5yd_8Q",
                        "eU2R8fKxSMazHB6HlFbpNg"
                    ]
                }
            }
        }
    }
}

However, after a certain period, possibly following cluster restarts, the model becomes unavailable, leading to an error: RequestError(400, 'illegal_argument_exception', 'Model not ready yet. Please run this first: POST /_plugins/_ml/models/WgtRZY0BjrbwPvlSkd2A/_deploy'). Additionally, the model profile then shows empty results, indicating the model is no longer deployed.

Related component

Plugins

To Reproduce

  1. Deploy the model using the POST request: POST {{opensearch_host}}/_plugins/_ml/models/WgtRZY0BjrbwPvlSkd2A/_deploy.
  2. Verify successful deployment with GET {{opensearch_host}}/_plugins/_ml/profile/models, which shows the model state as "DEPLOYED".
  3. Wait for some time or until after a potential cluster restart.
  4. Observe the error RequestError(400, 'illegal_argument_exception', 'Model not ready yet...') when attempting to use the model.
  5. Check the model profile again with GET {{opensearch_host}}/_plugins/_ml/profile/models, which now returns empty results for the model.

Expected behavior

Once a model is deployed, it should remain available for use without needing re-deployment, especially after cluster restarts. The model's status should consistently reflect its actual state, and any errors related to deployment should be clearly explained or resolved automatically by the system.

Additional Details

Plugins

"plugins": {
            "ml_commons": {
                "only_run_on_ml_node": "false",
                "trusted_connector_endpoints_regex": [
                    "^https://.*\\.openai\\.azure\\.com/.*$"
                ],
                "model_access_control_enabled": "true",
                "native_memory_threshold": "99"
            },
            "index_state_management": {
                "template_migration": {
                    "control": "-1"
                }
            },
            "query": {
                "executionengine": {}
            }
        },

Screenshots
If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

  • OS: Ubuntu Linux
  • Version latest

Additional context
This issue seems to occur intermittently and might be linked to cluster restarts or other changes in the AWS OpenSearch environment. It impacts the reliability of machine learning model deployments in production settings.

@ulan-yisaev ulan-yisaev added bug Something isn't working untriaged labels Feb 7, 2024
@dblock
Copy link
Member

dblock commented Feb 7, 2024

I'll move this to the ml plugin. Did you get a chance to open a ticket with AWS yet?

@dblock dblock transferred this issue from opensearch-project/OpenSearch Feb 7, 2024
@ulan-yisaev
Copy link
Contributor Author

ulan-yisaev commented Feb 8, 2024

@dblock , thanks.
Regarding your question about opening a ticket with AWS, I haven't done so yet. Before I proceed with that, I wanted to mention that this issue is not exclusive to the AWS OpenSearch. I have encountered the same problem on my local OpenSearch instance running in Docker, especially after restarting the container. This similarity suggests that the issue might be inherent to the OpenSearch ML plugin rather than specific to the AWS environment.

@ylwu-amzn
Copy link
Collaborator

@ulan-yisaev , Thanks for reporting this issue.

However, after a certain period, possibly following cluster restarts, the model becomes unavailable

Have you enabled the auto redeploy setting ? https://opensearch.org/docs/latest/ml-commons-plugin/cluster-settings/#enable-auto-redeploy

There is one known issue for model auto redeploy can't work for cluster restart case which fixed in this PR #1627, the fix will be in 2.12

@ulan-yisaev
Copy link
Contributor Author

Hello @ylwu-amzn!

Thank you for your response and the information provided.

I have indeed enabled the auto redeploy setting. However, after some time, I can confirm that this setting does not seem to resolve the issue. The models again disappeared from the deployment, and I had to redeploy them manually. It appears that the auto redeploy feature is not functioning as expected, at least in my setup.

I appreciate the reference to PR #1627 and look forward to the fix in version 2.12. In the meantime, if there are any other suggestions or workarounds to mitigate this issue, I would be grateful to hear them.

@ylwu-amzn
Copy link
Collaborator

ylwu-amzn commented Feb 9, 2024

One workaround is using Lambda function to regularly check the model status and auto redeploy it if the model is undeployed. Deploy API is idempotent, so you can also just run it regularly.

We built a sample lambda function. You can try this. Follow the readme in the zip.
auto-redeploy-model-lambda-function.zip

@ulan-yisaev
Copy link
Contributor Author

Thank you for the workaround suggestion and for providing the sample Lambda function. I appreciate your support.
I will try implementing this logic in my environment. Additionally, I'm considering adding exception handling to automatically redeploy the model in case of above exception.

@ylwu-amzn
Copy link
Collaborator

@Zhangxunmt has some proposal #1148 to automatically deploy model when predict request comes and model not deployed. He has a PoC ready.

I think that way could solve your problem. But feel free to come up with different options and discuss.

@b4sjoo b4sjoo removed the untriaged label Feb 13, 2024
@b4sjoo b4sjoo moved this to In Progress in ml-commons projects Feb 13, 2024
@Zhangxunmt
Copy link
Collaborator

@ulan-yisaev I see you used remote model in this case so I think the auto-deploy with TTL strategy will work for you. The auto-deploy of local/customer model is not in the scope yet since that will block the "Predict" process and adds more confusion. In short, manual deployment of local/custom models are still required. But the remote models will be included in the auto-deploy feature with TTL.

@Zhangxunmt
Copy link
Collaborator

@ulan-yisaev The remote model auto deploy is release in 2.13. Auto model undeploy with TTL is release in 2.14. There shouldn't be any problem in this issue anymore. Closing now. Feel free to reopen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Development

No branches or pull requests

5 participants