[BUG]Model content hash can't match original hash value #844

dhrubo-os · 2023-04-04T01:43:48Z

What is the bug?
Model content hash can't match original hash value

How can one reproduce the bug?

I tried with the current code base. I executed the command ./gradlew run to test.

First I upload the model

POST /_plugins/_ml/models/_upload
{
  "name": "huggingface/sentence-transformers/all-MiniLM-L12-v2",
  "version": "1.0.1",
  "model_format": "TORCH_SCRIPT"
}

Then Load the model in memory. This time it works fine. http://localhost:9200/_plugins/_ml/models/l-bnSYcBQg5VYC5uIxWy/_load
I also generated the embedding with the loaded model and it works fine.
Now I unload the model: http://localhost:9200/_plugins/_ml/models/l-bnSYcBQg5VYC5uIxWy/_unload
And now I try to deploy again. Then I face the error:

[2023-04-03T18:42:42,801][ERROR][o.o.m.m.MLModelManager   ] [integTest-0] Model content hash can't match original hash value
[2023-04-03T18:42:42,804][ERROR][o.o.m.a.f.TransportForwardAction] [integTest-0] deploy model failed on all nodes, model id: l-bnSYcBQg5VYC5uIxWy
[2023-04-03T18:42:42,804][INFO ][o.o.m.a.f.TransportForwardAction] [integTest-0] deploy model done with state: DEPLOY_FAILED, model id: l-bnSYcBQg5VYC5uIxWy
[2023-04-03T18:42:42,804][INFO ][o.o.m.a.d.TransportDeployModelOnNodeAction] [integTest-0] deploy model task done m-buSYcBQg5VYC5uMhUu

What is the expected behavior?
Model should load again.

What is your host/environment?

OS: [e.g. iOS]
Version [e.g. 2.7]
Plugins

Do you have any screenshots?
If applicable, add screenshots to help explain your problem.

Do you have any additional context?
Add any other context about the problem.

The text was updated successfully, but these errors were encountered:

saratvemulapalli · 2023-06-20T18:45:46Z

@dhrubo-os Im trying to understand how ml-commons works and this issue seemed like a good one to pick up.

I am trying to reproduce the problem, but I am unable to.
Here are the steps:

./gradlew run
Toggle ml to run on data nodes

PUT  {{Host}}/_cluster/settings
{
    "persistent": {
        "plugins.ml_commons.only_run_on_ml_node": false
    }
}

Upload a model

POST {{Host}}/_plugins/_ml/models/_upload
{
  "name": "huggingface/sentence-transformers/all-MiniLM-L12-v2",
  "version": "1.0.1",
  "model_format": "TORCH_SCRIPT"
}

Response

{
    "task_id": "pHIb2ogBrrwPAYySCj7w",
    "status": "CREATED"
}

Getting the task status fails

GET {{Host}}/_plugins/_ml/tasks/_search

Response

{
    "took": 78,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 1,
            "relation": "eq"
        },
        "max_score": 1.0,
        "hits": [
            {
                "_index": ".plugins-ml-task",
                "_id": "pHIb2ogBrrwPAYySCj7w",
                "_version": 2,
                "_seq_no": 1,
                "_primary_term": 1,
                "_score": 1.0,
                "_source": {
                    "last_update_time": 1687286385645,
                    "create_time": 1687286384263,
                    "is_async": true,
                    "function_name": "TEXT_EMBEDDING",
                    "worker_node": [
                        "bBpL_UF-SkaO6MjgEhr33w"
                    ],
                    "state": "FAILED",
                    "task_type": "DEPLOY_MODEL",
                    "error": "Native Memory Circuit Breaker is open, please check your resources!"
                }
            }
        ]
    }
}

I see Native Memory Circuit Breaker is open, please check your resources!. What am I missing?

dhrubo-os · 2023-06-20T18:59:03Z

{
  "persistent" : {
    "plugins.ml_commons.only_run_on_ml_node" : false,
    "plugins.ml_commons.native_memory_threshold" : 100, 
    "plugins.ml_commons.max_model_on_node": 20
  }
}

Can you please try to set all these values?

saratvemulapalli · 2023-06-20T20:02:51Z

Thanks @dhrubo-os.

I was able to reproduce the problem.

[2023-06-20T13:01:03,385][ERROR][o.o.m.m.MLModelManager   ] [integTest-0] Model content hash can't match original hash value

{
                "_index": ".plugins-ml-task",
                "_id": "qXJl2ogBrrwPAYySaz6W",
                "_version": 3,
                "_seq_no": 12,
                "_primary_term": 1,
                "_score": 1.0,
                "_source": {
                    "last_update_time": 1687291263388,
                    "create_time": 1687291259798,
                    "is_async": true,
                    "function_name": "TEXT_EMBEDDING",
                    "worker_node": [
                        "bBpL_UF-SkaO6MjgEhr33w"
                    ],
                    "model_id": "pnIv2ogBrrwPAYySjz7n",
                    "state": "FAILED",
                    "task_type": "DEPLOY_MODEL",
                    "error": "{\"bBpL_UF-SkaO6MjgEhr33w\":\"model content changed\"}"
                }
            }

saratvemulapalli · 2023-06-20T21:52:36Z

@dhrubo-os added a failing IntegTest #999.
I'll go through the workflow to fix it.

saratvemulapalli · 2023-06-29T21:56:15Z

Finally after all the breaking changes being merged. I've learnt this bug only happens on MacOS.
Redeploying a model on Linux/Windows works like usual, see test: #1016

nateynateynate · 2023-08-29T21:52:24Z

I'm suffering from this as well, but interestingly enough I'm only using the MacOS in the client environment. I'm port-forwarding https into an EC2 instance on port 5601 running the raw tarballs for 2.9. Should I still be encountering this one or am I bumping into something else?

saratvemulapalli · 2023-08-29T22:19:14Z

@nateynateynate from what I've tested and added an integration test #1016 you shouldn't see the problem when you have OpenSearch running on Linux/Windows host.

Should I still be encountering this one or am I bumping into something else?

Ideally you shouldn't, client really shouldn't matter. Can you post your stacktrace, setup and how to reproduce the problem?

Saikumar282 · 2023-10-09T19:35:56Z

@saratvemulapalli This is happening with docker which runs on Mac OS as well. Any alternatives!!!

saratvemulapalli · 2023-10-09T19:50:05Z

@saratvemulapalli This is happening with docker which runs on Mac OS as well. Any alternatives!!!

had a chat with @Saikumar282 on opensearch public slack, this is an expected problem on Darwin which ML-Commons doesn't officially support yet.

nateynateynate · 2023-10-09T20:22:07Z

I think we might be conflating two issues here. We're going to end up with a lot of people trying to register a model via a URL and not know to put the model content hash value with it as a result of the poor example here: https://opensearch.org/docs/latest/ml-commons-plugin/ml-framework/ . We have no official instructions on how to generate this hash value nor do we mention it anywhere except for in our call examples. This is something that happens regardless of what OS you're on.

dhrubo-os · 2023-10-09T20:25:05Z

@nateynateynate

https://opensearch.org/docs/latest/ml-commons-plugin/api/#request-fields

We mentioned in this request fields table.

juntezhang · 2023-10-11T12:41:36Z

I am having the same problem when uploading my custom model, but then when registering the model:

{
    "task_type": "REGISTER_MODEL",
    "function_name": "TEXT_EMBEDDING",
    "state": "FAILED",
    "worker_node": [
        "Etp4k_gISGipfSbPFIZIUg"
    ],
    "create_time": 1697027593551,
    "last_update_time": 1697027809366,
    "error": "model content changed",
    "is_async": true
}

Also running OpenSearch in Docker on Mac OS. I have no problems with the pre-loaded models, even when uploading them from URL. Any idea how to fix this?

Edit: Found the solution. You need to generate the checksum of your zip file with the custom model and pass it when uploading the model in the model_content_hash_value field. You can generate the checksum like this: shasum -a 256 paraphrase-multilingual-mpnet-base-v2.zip

nateynateynate · 2023-10-11T16:50:12Z

I am having the same problem when uploading my custom model, but then when registering the model:
{
    "task_type": "REGISTER_MODEL",
    "function_name": "TEXT_EMBEDDING",
    "state": "FAILED",
    "worker_node": [
        "Etp4k_gISGipfSbPFIZIUg"
    ],
    "create_time": 1697027593551,
    "last_update_time": 1697027809366,
    "error": "model content changed",
    "is_async": true
}
Also running OpenSearch in Docker on Mac OS. I have no problems with the pre-loaded models, even when uploading them from URL. Any idea how to fix this?

Edit: Found the solution. You need to generate the checksum of your zip file with the custom model and pass it when uploading the model in the model_content_hash_value field. You can generate the checksum like this: shasum -a 256 paraphrase-multilingual-mpnet-base-v2.zip

This is perhaps what I was doing a poor job of articulating. I dont' think this issue is specific to Darwin / OSX, but happens when uploading models without a model_content_hash_value field. The field was in the list of accepted parameters, but I don't think we do a very good job explaining that the user can calculate this field on their own.

Can we perhaps change the error message to give the recipients of it a lead? "model content changed - please use the shasum utility to re-register the model with the proper model_content_hash_value provided."

austintlee · 2023-11-04T16:41:15Z

@dhrubo-os In Step 4 of the sequence you describe, you unloaded the model and in Step 5, you said you deployed it, but did you run _load followed by _deploy?

austintlee · 2023-11-04T16:44:25Z

@juntezhang Can you create a separate issue for doing input validation on custom model registration?

dhrubo-os · 2023-11-04T17:16:59Z

@dhrubo-os In Step 4 of the sequence you describe, you unloaded the model and in Step 5, you said you deployed it, but did you run _load followed by _deploy?

_load and _deploy both are actually same functionality. We are just deprecating _load endpoint with _deploy

austintlee · 2023-11-05T01:14:25Z

The model hash after _unload/_load differed because the zip file in data/ml_cache/models_cache/deploy/<model_id> never got deleted and as a result, it kept getting added to the same zip file upon re-load.

ml-commons/plugin/src/main/java/org/opensearch/ml/model/MLModelManager.java

Line 1080 in 510f50b

FileUtils.mergeFiles(chunkFiles, modelZipFile);

I noticed that the zip file stayed and its size kept growing.

I found two issues related to this.

The first is that the security manager was disallowing setReadOnly (commons-io tries to change the file permissions to be able to delete model files and directories).

[2023-11-04T15:01:28,207][ERROR][o.o.m.e.a.DLModel        ] [integTest-0] Failed to deploy model eRPym4sBExl3TU0X9MGS
 java.security.AccessControlException: access denied ("java.lang.RuntimePermission" "accessUserInformation")
     at java.base/java.security.AccessControlContext.checkPermission(AccessControlContext.java:485) ~[?:?]
     at java.base/java.security.AccessController.checkPermission(AccessController.java:1068) ~[?:?]
     at java.base/java.lang.SecurityManager.checkPermission(SecurityManager.java:416) ~[?:?]
     at java.base/sun.nio.fs.UnixFileAttributeViews$Posix.checkReadExtended(UnixFileAttributeViews.java:186) ~[?:?]
     at java.base/sun.nio.fs.UnixFileAttributeViews$Posix.readAttributes(UnixFileAttributeViews.java:253) ~[?:?]
     at java.base/sun.nio.fs.UnixFileAttributeViews$Posix.readAttributes(UnixFileAttributeViews.java:168) ~[?:?]
     at org.apache.commons.io.file.PathUtils.setReadOnly(PathUtils.java:927) ~[commons-io-2.11.0.jar:2.11.0]
     at org.apache.commons.io.file.PathUtils.deleteFile(PathUtils.java:485) ~[commons-io-2.11.0.jar:2.11.0]
     at org.apache.commons.io.file.PathUtils.delete(PathUtils.java:392) ~[commons-io-2.11.0.jar:2.11.0]
     at org.apache.commons.io.FileUtils.forceDelete(FileUtils.java:1341) ~[commons-io-2.11.0.jar:2.11.0]
     at org.apache.commons.io.FileUtils.cleanDirectory(FileUtils.java:324) ~[commons-io-2.11.0.jar:2.11.0]
     at org.apache.commons.io.FileUtils.deleteDirectory(FileUtils.java:1192) ~[commons-io-2.11.0.jar:2.11.0]
     at org.opensearch.ml.engine.algorithms.DLModel.lambda$loadModel$1(DLModel.java:256) [opensearch-ml-algorithms-3.0.0.0-SNAPSHOT.jar:?]
     at java.base/java.security.AccessController.doPrivileged(AccessController.java:569) [?:?]
     at org.opensearch.ml.engine.algorithms.DLModel.loadModel(DLModel.java:242) [opensearch-ml-algorithms-3.0.0.0-SNAPSHOT.jar:?]
     at org.opensearch.ml.engine.algorithms.DLModel.initModel(DLModel.java:138) [opensearch-ml-algorithms-3.0.0.0-SNAPSHOT.jar:?]
     at org.opensearch.ml.engine.MLEngine.deploy(MLEngine.java:125) [opensearch-ml-algorithms-3.0.0.0-SNAPSHOT.jar:?]
     at org.opensearch.ml.model.MLModelManager.lambda$deployModel$49(MLModelManager.java:960) [opensearch-ml-3.0.0.0-SNAPSHOT.jar:3.0.0.0-SNAPSHOT]
     at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82) [opensearch-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
     at org.opensearch.ml.model.MLModelManager.lambda$retrieveModelChunks$55(MLModelManager.java:1080) [opensearch-ml-3.0.0.0-SNAPSHOT.jar:3.0.0.0-SNAPSHOT]
     at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82) [opensearch-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
     at org.opensearch.action.support.ThreadedActionListener$1.doRun(ThreadedActionListener.java:78) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
     at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:908) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
     at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
     at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
     at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
     at java.base/java.lang.Thread.run(Thread.java:833) [?:?]

After I added the necessary permission to the plugin-security.policy, _unload/_load started to work.

While I was looking into why the delete was failing on Mac, I came across this issue:

https://issues.apache.org/jira/browse/IO-787

It says there was regression in commons-io 2.11 (this is the version ml-commons is currently using) specifically related to the forceDelete method that we're using. Strangely, I don't see what code change they made to fix that in commons-io 2.12, but since we are currently on 2.11 and it has a known issue for Mac, I am upgrading commons-io to the latest 2.15.

ylwu-amzn · 2023-11-05T02:20:57Z

@austintlee Thanks Austin for deep dive this issue. After upgrading commons-io, the issue is gone?

austintlee · 2023-11-05T05:23:34Z

That and the plugin-security.policy change. I am not able to repro it after my fix.

manzke · 2024-02-26T15:31:13Z

Any update on this one? I can reproduce it with docker on linux (aws)

manzke · 2024-02-26T21:48:18Z

can it be that this was fixed with 2.12. and the commons-io upgrade? I can't test it right now, because we have a dependency on 2.11.

what was the fix, just the upgrade? possible to backport (by us)?
was it fixed in the last release?

dhrubo-os · 2024-04-09T19:51:36Z

Closing this issue, tested in my end. I don't see this issue anymore.

jlibx · 2024-08-09T07:18:32Z

opensearch-ml-gpu | [2024-08-09T07:17:54,585][DEBUG][o.o.m.e.u.FileUtils ] [opensearch-ml-gpu] merge 61 files into /usr/share/opensearch/data/ml_cache/models_cache/deploy/HdPgNZEBkGu7typLkQJX/cre_pt_v0_2_0_test2.zip
opensearch-ml-gpu | [2024-08-09T07:17:54,782][DEBUG][o.o.t.TransportService ] [opensearch-ml-gpu] Action: internal:coordination/fault_detection/leader_check
opensearch-ml-gpu | [2024-08-09T07:17:54,961][DEBUG][o.o.n.r.t.AverageMemoryUsageTracker] [opensearch-ml-gpu] Recording memory usage: 25%
opensearch-ml-gpu | [2024-08-09T07:17:54,990][DEBUG][o.o.n.r.t.AverageCpuUsageTracker] [opensearch-ml-gpu] Recording cpu usage: 0%
opensearch-ml-gpu | [2024-08-09T07:17:55,461][DEBUG][o.o.n.r.t.AverageMemoryUsageTracker] [opensearch-ml-gpu] Recording memory usage: 26%
opensearch-ml-gpu | [2024-08-09T07:17:55,491][DEBUG][o.o.n.r.t.AverageCpuUsageTracker] [opensearch-ml-gpu] Recording cpu usage: 6%
opensearch-ml-gpu | [2024-08-09T07:17:55,898][DEBUG][o.o.t.TransportService ] [opensearch-ml-gpu] Action: internal:coordination/fault_detection/leader_check
opensearch-ml-gpu | [2024-08-09T07:17:55,962][DEBUG][o.o.n.r.t.AverageMemoryUsageTracker] [opensearch-ml-gpu] Recording memory usage: 26%
opensearch-ml-gpu | [2024-08-09T07:17:55,991][DEBUG][o.o.n.r.t.AverageCpuUsageTracker] [opensearch-ml-gpu] Recording cpu usage: 0%
opensearch-ml-gpu | [2024-08-09T07:17:56,055][ERROR][o.o.m.m.MLModelManager ] [opensearch-ml-gpu] Model content hash can't match original hash value
opensearch-ml-gpu | [2024-08-09T07:17:56,055][DEBUG][o.o.m.m.MLModelCacheHelper] [opensearch-ml-gpu] removing model HdPgNZEBkGu7typLkQJX from cache
opensearch-ml-gpu | [2024-08-09T07:17:56,148][DEBUG][o.o.m.m.MLModelCacheHelper] [opensearch-ml-gpu] Setting the auto deploying flag for Model HdPgNZEBkGu7typLkQJX
opensearch-ml-gpu | [2024-08-09T07:17:56,148][DEBUG][o.o.m.t.MLTaskManager ] [opensearch-ml-gpu] remove ML task from cache aYjwNZEBtLXkNmkPz_gQ
opensearch-ml-gpu | [2024-08-09T07:17:56,148][DEBUG][o.o.t.TransportService ] [opensearch-ml-gpu] Action: cluster:admin/opensearch/mlinternal/forward
opensearch-ml-gpu | [2024-08-09T07:17:56,279][INFO ][o.o.m.a.d.TransportDeployModelOnNodeAction] [opensearch-ml-gpu] deploy model task done aYjwNZEBtLXkNmkPz_gQ

dhrubo-os added bug Something isn't working untriaged labels Apr 4, 2023

dhrubo-os removed the untriaged label Jun 7, 2023

saratvemulapalli mentioned this issue Jun 20, 2023

Adding a failing test for redeploy model #999

Closed

5 tasks

saratvemulapalli mentioned this issue Jun 26, 2023

[2.x] Adding an integration test for redeploying a model #1016

Merged

5 tasks

saratvemulapalli mentioned this issue Oct 2, 2023

Ignoring Redeploy test on MacOS due to known failures #1414

Merged

5 tasks

austintlee mentioned this issue Oct 31, 2023

[BUG] Error "model content changed" while loading pretrained models #1571

Closed

austintlee mentioned this issue Nov 4, 2023

[BUG] Failed deploy/load results in a new model group #1597

Open

austintlee self-assigned this Nov 5, 2023

austintlee mentioned this issue Nov 5, 2023

Add accessUserInformation to the plugin security policy. (Issue # 844) #1598

Merged

3 tasks

ylwu-amzn added this to ml-commons projects Nov 17, 2023

ylwu-amzn moved this to In Progress in ml-commons projects Nov 17, 2023

manzke mentioned this issue Feb 26, 2024

[BUG] ML model not in cache. Remove all of its cache files. model id #2158

Open

zane-neo mentioned this issue Mar 6, 2024

Fix delete model cache on macOS causing model deploy fail with model … #2180

Merged

5 tasks

dhrubo-os closed this as completed Apr 9, 2024

github-project-automation bot moved this from In Progress to Done in ml-commons projects Apr 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]Model content hash can't match original hash value #844

[BUG]Model content hash can't match original hash value #844

dhrubo-os commented Apr 4, 2023

saratvemulapalli commented Jun 20, 2023

dhrubo-os commented Jun 20, 2023

saratvemulapalli commented Jun 20, 2023

saratvemulapalli commented Jun 20, 2023 •

edited

Loading

saratvemulapalli commented Jun 29, 2023

nateynateynate commented Aug 29, 2023

saratvemulapalli commented Aug 29, 2023

Saikumar282 commented Oct 9, 2023

saratvemulapalli commented Oct 9, 2023

nateynateynate commented Oct 9, 2023

dhrubo-os commented Oct 9, 2023

juntezhang commented Oct 11, 2023 •

edited

Loading

nateynateynate commented Oct 11, 2023

austintlee commented Nov 4, 2023

austintlee commented Nov 4, 2023

dhrubo-os commented Nov 4, 2023

austintlee commented Nov 5, 2023

ylwu-amzn commented Nov 5, 2023

austintlee commented Nov 5, 2023

manzke commented Feb 26, 2024

manzke commented Feb 26, 2024

dhrubo-os commented Apr 9, 2024

jlibx commented Aug 9, 2024

[BUG]Model content hash can't match original hash value #844

[BUG]Model content hash can't match original hash value #844

Comments

dhrubo-os commented Apr 4, 2023

saratvemulapalli commented Jun 20, 2023

dhrubo-os commented Jun 20, 2023

saratvemulapalli commented Jun 20, 2023

saratvemulapalli commented Jun 20, 2023 • edited Loading

saratvemulapalli commented Jun 29, 2023

nateynateynate commented Aug 29, 2023

saratvemulapalli commented Aug 29, 2023

Saikumar282 commented Oct 9, 2023

saratvemulapalli commented Oct 9, 2023

nateynateynate commented Oct 9, 2023

dhrubo-os commented Oct 9, 2023

juntezhang commented Oct 11, 2023 • edited Loading

nateynateynate commented Oct 11, 2023

austintlee commented Nov 4, 2023

austintlee commented Nov 4, 2023

dhrubo-os commented Nov 4, 2023

austintlee commented Nov 5, 2023

ylwu-amzn commented Nov 5, 2023

austintlee commented Nov 5, 2023

manzke commented Feb 26, 2024

manzke commented Feb 26, 2024

dhrubo-os commented Apr 9, 2024

jlibx commented Aug 9, 2024

saratvemulapalli commented Jun 20, 2023 •

edited

Loading

juntezhang commented Oct 11, 2023 •

edited

Loading