[BUG] Intermittent Model Deployment Issues in AWS OpenSearch: "Model not ready yet" Error #2050

ulan-yisaev · 2024-02-07T20:23:15Z

Describe the bug

Issue Description:
After successfully deploying a machine learning model in AWS OpenSearch using the POST request to /_plugins/_ml/models/WgtRZY0BjrbwPvlSkd2A/_deploy, the model initially shows as deployed when checked with
GET {{opensearch_host}}/_plugins/_ml/profile/models:

{
    "nodes": {
        "eU2R8fKxSMazHB6HlFbpNg": {
            "models": {
                "WgtRZY0BjrbwPvlSkd2A": {
                    "model_state": "DEPLOYED",
                    "predictor": "org.opensearch.ml.engine.algorithms.remote.RemoteModel@405c0c42",
                    "target_worker_nodes": [
                        "GlBdA3joT4ebA2Wj5yd_8Q",
                        "eU2R8fKxSMazHB6HlFbpNg"
                    ],
                    "worker_nodes": [
                        "GlBdA3joT4ebA2Wj5yd_8Q",
                        "eU2R8fKxSMazHB6HlFbpNg"
                    ]
                }
            }
        },
        "GlBdA3joT4ebA2Wj5yd_8Q": {
            "models": {
                "WgtRZY0BjrbwPvlSkd2A": {
                    "model_state": "DEPLOYED",
                    "predictor": "org.opensearch.ml.engine.algorithms.remote.RemoteModel@85b1387",
                    "target_worker_nodes": [
                        "GlBdA3joT4ebA2Wj5yd_8Q",
                        "eU2R8fKxSMazHB6HlFbpNg"
                    ],
                    "worker_nodes": [
                        "GlBdA3joT4ebA2Wj5yd_8Q",
                        "eU2R8fKxSMazHB6HlFbpNg"
                    ]
                }
            }
        }
    }
}

However, after a certain period, possibly following cluster restarts, the model becomes unavailable, leading to an error: RequestError(400, 'illegal_argument_exception', 'Model not ready yet. Please run this first: POST /_plugins/_ml/models/WgtRZY0BjrbwPvlSkd2A/_deploy'). Additionally, the model profile then shows empty results, indicating the model is no longer deployed.

Related component

Plugins

To Reproduce

Deploy the model using the POST request: POST {{opensearch_host}}/_plugins/_ml/models/WgtRZY0BjrbwPvlSkd2A/_deploy.
Verify successful deployment with GET {{opensearch_host}}/_plugins/_ml/profile/models, which shows the model state as "DEPLOYED".
Wait for some time or until after a potential cluster restart.
Observe the error RequestError(400, 'illegal_argument_exception', 'Model not ready yet...') when attempting to use the model.
Check the model profile again with GET {{opensearch_host}}/_plugins/_ml/profile/models, which now returns empty results for the model.

Expected behavior

Once a model is deployed, it should remain available for use without needing re-deployment, especially after cluster restarts. The model's status should consistently reflect its actual state, and any errors related to deployment should be clearly explained or resolved automatically by the system.

Additional Details

Plugins

"plugins": {
            "ml_commons": {
                "only_run_on_ml_node": "false",
                "trusted_connector_endpoints_regex": [
                    "^https://.*\\.openai\\.azure\\.com/.*$"
                ],
                "model_access_control_enabled": "true",
                "native_memory_threshold": "99"
            },
            "index_state_management": {
                "template_migration": {
                    "control": "-1"
                }
            },
            "query": {
                "executionengine": {}
            }
        },

Screenshots
If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

OS: Ubuntu Linux
Version latest

Additional context
This issue seems to occur intermittently and might be linked to cluster restarts or other changes in the AWS OpenSearch environment. It impacts the reliability of machine learning model deployments in production settings.

The text was updated successfully, but these errors were encountered:

dblock · 2024-02-07T21:26:00Z

I'll move this to the ml plugin. Did you get a chance to open a ticket with AWS yet?

ulan-yisaev · 2024-02-08T09:09:58Z

@dblock , thanks.
Regarding your question about opening a ticket with AWS, I haven't done so yet. Before I proceed with that, I wanted to mention that this issue is not exclusive to the AWS OpenSearch. I have encountered the same problem on my local OpenSearch instance running in Docker, especially after restarting the container. This similarity suggests that the issue might be inherent to the OpenSearch ML plugin rather than specific to the AWS environment.

ylwu-amzn · 2024-02-08T18:25:57Z

@ulan-yisaev , Thanks for reporting this issue.

However, after a certain period, possibly following cluster restarts, the model becomes unavailable

Have you enabled the auto redeploy setting ? https://opensearch.org/docs/latest/ml-commons-plugin/cluster-settings/#enable-auto-redeploy

There is one known issue for model auto redeploy can't work for cluster restart case which fixed in this PR #1627, the fix will be in 2.12

ulan-yisaev · 2024-02-09T20:06:56Z

Hello @ylwu-amzn!

Thank you for your response and the information provided.

I have indeed enabled the auto redeploy setting. However, after some time, I can confirm that this setting does not seem to resolve the issue. The models again disappeared from the deployment, and I had to redeploy them manually. It appears that the auto redeploy feature is not functioning as expected, at least in my setup.

I appreciate the reference to PR #1627 and look forward to the fix in version 2.12. In the meantime, if there are any other suggestions or workarounds to mitigate this issue, I would be grateful to hear them.

ylwu-amzn · 2024-02-09T20:48:16Z

One workaround is using Lambda function to regularly check the model status and auto redeploy it if the model is undeployed. Deploy API is idempotent, so you can also just run it regularly.

We built a sample lambda function. You can try this. Follow the readme in the zip.
auto-redeploy-model-lambda-function.zip

ulan-yisaev · 2024-02-12T09:22:44Z

Thank you for the workaround suggestion and for providing the sample Lambda function. I appreciate your support.
I will try implementing this logic in my environment. Additionally, I'm considering adding exception handling to automatically redeploy the model in case of above exception.

ylwu-amzn · 2024-02-13T07:13:40Z

@Zhangxunmt has some proposal #1148 to automatically deploy model when predict request comes and model not deployed. He has a PoC ready.

I think that way could solve your problem. But feel free to come up with different options and discuss.

Zhangxunmt · 2024-02-15T18:41:42Z

@ulan-yisaev I see you used remote model in this case so I think the auto-deploy with TTL strategy will work for you. The auto-deploy of local/customer model is not in the scope yet since that will block the "Predict" process and adds more confusion. In short, manual deployment of local/custom models are still required. But the remote models will be included in the auto-deploy feature with TTL.

Zhangxunmt · 2024-05-07T18:34:07Z

@ulan-yisaev The remote model auto deploy is release in 2.13. Auto model undeploy with TTL is release in 2.14. There shouldn't be any problem in this issue anymore. Closing now. Feel free to reopen.

ulan-yisaev added bug Something isn't working untriaged labels Feb 7, 2024

dblock transferred this issue from opensearch-project/OpenSearch Feb 7, 2024

jngz-es added this to ml-commons projects Feb 13, 2024

b4sjoo removed the untriaged label Feb 13, 2024

b4sjoo assigned Zhangxunmt Feb 13, 2024

b4sjoo moved this to In Progress in ml-commons projects Feb 13, 2024

Zhangxunmt closed this as completed May 7, 2024

github-project-automation bot moved this from In Progress to Done in ml-commons projects May 7, 2024

Zhangxunmt mentioned this issue Nov 15, 2024

[FEATURE] Unsynchronize the deployment for ml-commons remote models in a OpenSearch domain #3222

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Intermittent Model Deployment Issues in AWS OpenSearch: "Model not ready yet" Error #2050

[BUG] Intermittent Model Deployment Issues in AWS OpenSearch: "Model not ready yet" Error #2050

ulan-yisaev commented Feb 7, 2024

dblock commented Feb 7, 2024

ulan-yisaev commented Feb 8, 2024 •

edited

Loading

ylwu-amzn commented Feb 8, 2024

ulan-yisaev commented Feb 9, 2024

ylwu-amzn commented Feb 9, 2024 •

edited

Loading

ulan-yisaev commented Feb 12, 2024

ylwu-amzn commented Feb 13, 2024

Zhangxunmt commented Feb 15, 2024

Zhangxunmt commented May 7, 2024

[BUG] Intermittent Model Deployment Issues in AWS OpenSearch: "Model not ready yet" Error #2050

[BUG] Intermittent Model Deployment Issues in AWS OpenSearch: "Model not ready yet" Error #2050

Comments

ulan-yisaev commented Feb 7, 2024

Describe the bug

Related component

To Reproduce

Expected behavior

Additional Details

dblock commented Feb 7, 2024

ulan-yisaev commented Feb 8, 2024 • edited Loading

ylwu-amzn commented Feb 8, 2024

ulan-yisaev commented Feb 9, 2024

ylwu-amzn commented Feb 9, 2024 • edited Loading

ulan-yisaev commented Feb 12, 2024

ylwu-amzn commented Feb 13, 2024

Zhangxunmt commented Feb 15, 2024

Zhangxunmt commented May 7, 2024

ulan-yisaev commented Feb 8, 2024 •

edited

Loading

ylwu-amzn commented Feb 9, 2024 •

edited

Loading