Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]Model content hash can't match original hash value #844

Closed
dhrubo-os opened this issue Apr 4, 2023 · 23 comments
Closed

[BUG]Model content hash can't match original hash value #844

dhrubo-os opened this issue Apr 4, 2023 · 23 comments
Assignees
Labels
bug Something isn't working

Comments

@dhrubo-os
Copy link
Collaborator

What is the bug?
Model content hash can't match original hash value

How can one reproduce the bug?

I tried with the current code base. I executed the command ./gradlew run to test.

  1. First I upload the model
POST /_plugins/_ml/models/_upload
{
  "name": "huggingface/sentence-transformers/all-MiniLM-L12-v2",
  "version": "1.0.1",
  "model_format": "TORCH_SCRIPT"
}
  1. Then Load the model in memory. This time it works fine. http://localhost:9200/_plugins/_ml/models/l-bnSYcBQg5VYC5uIxWy/_load

  2. I also generated the embedding with the loaded model and it works fine.

  3. Now I unload the model: http://localhost:9200/_plugins/_ml/models/l-bnSYcBQg5VYC5uIxWy/_unload

  4. And now I try to deploy again. Then I face the error:

[2023-04-03T18:42:42,801][ERROR][o.o.m.m.MLModelManager   ] [integTest-0] Model content hash can't match original hash value
[2023-04-03T18:42:42,804][ERROR][o.o.m.a.f.TransportForwardAction] [integTest-0] deploy model failed on all nodes, model id: l-bnSYcBQg5VYC5uIxWy
[2023-04-03T18:42:42,804][INFO ][o.o.m.a.f.TransportForwardAction] [integTest-0] deploy model done with state: DEPLOY_FAILED, model id: l-bnSYcBQg5VYC5uIxWy
[2023-04-03T18:42:42,804][INFO ][o.o.m.a.d.TransportDeployModelOnNodeAction] [integTest-0] deploy model task done m-buSYcBQg5VYC5uMhUu

What is the expected behavior?
Model should load again.

What is your host/environment?

  • OS: [e.g. iOS]
  • Version [e.g. 2.7]
  • Plugins

Do you have any screenshots?
If applicable, add screenshots to help explain your problem.

Do you have any additional context?
Add any other context about the problem.

@dhrubo-os dhrubo-os added bug Something isn't working untriaged labels Apr 4, 2023
@saratvemulapalli
Copy link
Member

@dhrubo-os Im trying to understand how ml-commons works and this issue seemed like a good one to pick up.

I am trying to reproduce the problem, but I am unable to.
Here are the steps:

  1. ./gradlew run
  2. Toggle ml to run on data nodes
PUT  {{Host}}/_cluster/settings
{
    "persistent": {
        "plugins.ml_commons.only_run_on_ml_node": false
    }
}
  1. Upload a model
POST {{Host}}/_plugins/_ml/models/_upload
{
  "name": "huggingface/sentence-transformers/all-MiniLM-L12-v2",
  "version": "1.0.1",
  "model_format": "TORCH_SCRIPT"
}

Response

{
    "task_id": "pHIb2ogBrrwPAYySCj7w",
    "status": "CREATED"
}
  1. Getting the task status fails
GET {{Host}}/_plugins/_ml/tasks/_search

Response

{
    "took": 78,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 1,
            "relation": "eq"
        },
        "max_score": 1.0,
        "hits": [
            {
                "_index": ".plugins-ml-task",
                "_id": "pHIb2ogBrrwPAYySCj7w",
                "_version": 2,
                "_seq_no": 1,
                "_primary_term": 1,
                "_score": 1.0,
                "_source": {
                    "last_update_time": 1687286385645,
                    "create_time": 1687286384263,
                    "is_async": true,
                    "function_name": "TEXT_EMBEDDING",
                    "worker_node": [
                        "bBpL_UF-SkaO6MjgEhr33w"
                    ],
                    "state": "FAILED",
                    "task_type": "DEPLOY_MODEL",
                    "error": "Native Memory Circuit Breaker is open, please check your resources!"
                }
            }
        ]
    }
}

I see Native Memory Circuit Breaker is open, please check your resources!. What am I missing?

@dhrubo-os
Copy link
Collaborator Author

{
  "persistent" : {
    "plugins.ml_commons.only_run_on_ml_node" : false,
    "plugins.ml_commons.native_memory_threshold" : 100, 
    "plugins.ml_commons.max_model_on_node": 20
  }
}

Can you please try to set all these values?

@saratvemulapalli
Copy link
Member

Thanks @dhrubo-os.

I was able to reproduce the problem.

[2023-06-20T13:01:03,385][ERROR][o.o.m.m.MLModelManager   ] [integTest-0] Model content hash can't match original hash value

{
                "_index": ".plugins-ml-task",
                "_id": "qXJl2ogBrrwPAYySaz6W",
                "_version": 3,
                "_seq_no": 12,
                "_primary_term": 1,
                "_score": 1.0,
                "_source": {
                    "last_update_time": 1687291263388,
                    "create_time": 1687291259798,
                    "is_async": true,
                    "function_name": "TEXT_EMBEDDING",
                    "worker_node": [
                        "bBpL_UF-SkaO6MjgEhr33w"
                    ],
                    "model_id": "pnIv2ogBrrwPAYySjz7n",
                    "state": "FAILED",
                    "task_type": "DEPLOY_MODEL",
                    "error": "{\"bBpL_UF-SkaO6MjgEhr33w\":\"model content changed\"}"
                }
            }

@saratvemulapalli
Copy link
Member

saratvemulapalli commented Jun 20, 2023

@dhrubo-os added a failing IntegTest #999.
I'll go through the workflow to fix it.

@saratvemulapalli
Copy link
Member

Finally after all the breaking changes being merged. I've learnt this bug only happens on MacOS.
Redeploying a model on Linux/Windows works like usual, see test: #1016

@nateynateynate
Copy link
Member

I'm suffering from this as well, but interestingly enough I'm only using the MacOS in the client environment. I'm port-forwarding https into an EC2 instance on port 5601 running the raw tarballs for 2.9. Should I still be encountering this one or am I bumping into something else?

@saratvemulapalli
Copy link
Member

@nateynateynate from what I've tested and added an integration test #1016 you shouldn't see the problem when you have OpenSearch running on Linux/Windows host.

Should I still be encountering this one or am I bumping into something else?

Ideally you shouldn't, client really shouldn't matter. Can you post your stacktrace, setup and how to reproduce the problem?

@Saikumar282
Copy link

@saratvemulapalli This is happening with docker which runs on Mac OS as well. Any alternatives!!!

@saratvemulapalli
Copy link
Member

@saratvemulapalli This is happening with docker which runs on Mac OS as well. Any alternatives!!!

had a chat with @Saikumar282 on opensearch public slack, this is an expected problem on Darwin which ML-Commons doesn't officially support yet.

@nateynateynate
Copy link
Member

I think we might be conflating two issues here. We're going to end up with a lot of people trying to register a model via a URL and not know to put the model content hash value with it as a result of the poor example here: https://opensearch.org/docs/latest/ml-commons-plugin/ml-framework/ . We have no official instructions on how to generate this hash value nor do we mention it anywhere except for in our call examples. This is something that happens regardless of what OS you're on.

@dhrubo-os
Copy link
Collaborator Author

@nateynateynate

https://opensearch.org/docs/latest/ml-commons-plugin/api/#request-fields

We mentioned in this request fields table.

@juntezhang
Copy link

juntezhang commented Oct 11, 2023

I am having the same problem when uploading my custom model, but then when registering the model:

{
    "task_type": "REGISTER_MODEL",
    "function_name": "TEXT_EMBEDDING",
    "state": "FAILED",
    "worker_node": [
        "Etp4k_gISGipfSbPFIZIUg"
    ],
    "create_time": 1697027593551,
    "last_update_time": 1697027809366,
    "error": "model content changed",
    "is_async": true
}

Also running OpenSearch in Docker on Mac OS. I have no problems with the pre-loaded models, even when uploading them from URL. Any idea how to fix this?

Edit: Found the solution. You need to generate the checksum of your zip file with the custom model and pass it when uploading the model in the model_content_hash_value field. You can generate the checksum like this: shasum -a 256 paraphrase-multilingual-mpnet-base-v2.zip

@nateynateynate
Copy link
Member

I am having the same problem when uploading my custom model, but then when registering the model:

{
    "task_type": "REGISTER_MODEL",
    "function_name": "TEXT_EMBEDDING",
    "state": "FAILED",
    "worker_node": [
        "Etp4k_gISGipfSbPFIZIUg"
    ],
    "create_time": 1697027593551,
    "last_update_time": 1697027809366,
    "error": "model content changed",
    "is_async": true
}

Also running OpenSearch in Docker on Mac OS. I have no problems with the pre-loaded models, even when uploading them from URL. Any idea how to fix this?

Edit: Found the solution. You need to generate the checksum of your zip file with the custom model and pass it when uploading the model in the model_content_hash_value field. You can generate the checksum like this: shasum -a 256 paraphrase-multilingual-mpnet-base-v2.zip

This is perhaps what I was doing a poor job of articulating. I dont' think this issue is specific to Darwin / OSX, but happens when uploading models without a model_content_hash_value field. The field was in the list of accepted parameters, but I don't think we do a very good job explaining that the user can calculate this field on their own.

Can we perhaps change the error message to give the recipients of it a lead? "model content changed - please use the shasum utility to re-register the model with the proper model_content_hash_value provided."

@austintlee
Copy link
Collaborator

@dhrubo-os In Step 4 of the sequence you describe, you unloaded the model and in Step 5, you said you deployed it, but did you run _load followed by _deploy?

@austintlee
Copy link
Collaborator

@juntezhang Can you create a separate issue for doing input validation on custom model registration?

@dhrubo-os
Copy link
Collaborator Author

@dhrubo-os In Step 4 of the sequence you describe, you unloaded the model and in Step 5, you said you deployed it, but did you run _load followed by _deploy?

_load and _deploy both are actually same functionality. We are just deprecating _load endpoint with _deploy

@austintlee
Copy link
Collaborator

The model hash after _unload/_load differed because the zip file in data/ml_cache/models_cache/deploy/<model_id> never got deleted and as a result, it kept getting added to the same zip file upon re-load.

FileUtils.mergeFiles(chunkFiles, modelZipFile);

I noticed that the zip file stayed and its size kept growing.

I found two issues related to this.

The first is that the security manager was disallowing setReadOnly (commons-io tries to change the file permissions to be able to delete model files and directories).

[2023-11-04T15:01:28,207][ERROR][o.o.m.e.a.DLModel        ] [integTest-0] Failed to deploy model eRPym4sBExl3TU0X9MGS
 java.security.AccessControlException: access denied ("java.lang.RuntimePermission" "accessUserInformation")
     at java.base/java.security.AccessControlContext.checkPermission(AccessControlContext.java:485) ~[?:?]
     at java.base/java.security.AccessController.checkPermission(AccessController.java:1068) ~[?:?]
     at java.base/java.lang.SecurityManager.checkPermission(SecurityManager.java:416) ~[?:?]
     at java.base/sun.nio.fs.UnixFileAttributeViews$Posix.checkReadExtended(UnixFileAttributeViews.java:186) ~[?:?]
     at java.base/sun.nio.fs.UnixFileAttributeViews$Posix.readAttributes(UnixFileAttributeViews.java:253) ~[?:?]
     at java.base/sun.nio.fs.UnixFileAttributeViews$Posix.readAttributes(UnixFileAttributeViews.java:168) ~[?:?]
     at org.apache.commons.io.file.PathUtils.setReadOnly(PathUtils.java:927) ~[commons-io-2.11.0.jar:2.11.0]
     at org.apache.commons.io.file.PathUtils.deleteFile(PathUtils.java:485) ~[commons-io-2.11.0.jar:2.11.0]
     at org.apache.commons.io.file.PathUtils.delete(PathUtils.java:392) ~[commons-io-2.11.0.jar:2.11.0]
     at org.apache.commons.io.FileUtils.forceDelete(FileUtils.java:1341) ~[commons-io-2.11.0.jar:2.11.0]
     at org.apache.commons.io.FileUtils.cleanDirectory(FileUtils.java:324) ~[commons-io-2.11.0.jar:2.11.0]
     at org.apache.commons.io.FileUtils.deleteDirectory(FileUtils.java:1192) ~[commons-io-2.11.0.jar:2.11.0]
     at org.opensearch.ml.engine.algorithms.DLModel.lambda$loadModel$1(DLModel.java:256) [opensearch-ml-algorithms-3.0.0.0-SNAPSHOT.jar:?]
     at java.base/java.security.AccessController.doPrivileged(AccessController.java:569) [?:?]
     at org.opensearch.ml.engine.algorithms.DLModel.loadModel(DLModel.java:242) [opensearch-ml-algorithms-3.0.0.0-SNAPSHOT.jar:?]
     at org.opensearch.ml.engine.algorithms.DLModel.initModel(DLModel.java:138) [opensearch-ml-algorithms-3.0.0.0-SNAPSHOT.jar:?]
     at org.opensearch.ml.engine.MLEngine.deploy(MLEngine.java:125) [opensearch-ml-algorithms-3.0.0.0-SNAPSHOT.jar:?]
     at org.opensearch.ml.model.MLModelManager.lambda$deployModel$49(MLModelManager.java:960) [opensearch-ml-3.0.0.0-SNAPSHOT.jar:3.0.0.0-SNAPSHOT]
     at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82) [opensearch-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
     at org.opensearch.ml.model.MLModelManager.lambda$retrieveModelChunks$55(MLModelManager.java:1080) [opensearch-ml-3.0.0.0-SNAPSHOT.jar:3.0.0.0-SNAPSHOT]
     at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82) [opensearch-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
     at org.opensearch.action.support.ThreadedActionListener$1.doRun(ThreadedActionListener.java:78) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
     at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:908) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
     at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
     at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
     at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
     at java.base/java.lang.Thread.run(Thread.java:833) [?:?]

After I added the necessary permission to the plugin-security.policy, _unload/_load started to work.

While I was looking into why the delete was failing on Mac, I came across this issue:

https://issues.apache.org/jira/browse/IO-787

It says there was regression in commons-io 2.11 (this is the version ml-commons is currently using) specifically related to the forceDelete method that we're using. Strangely, I don't see what code change they made to fix that in commons-io 2.12, but since we are currently on 2.11 and it has a known issue for Mac, I am upgrading commons-io to the latest 2.15.

@ylwu-amzn
Copy link
Collaborator

@austintlee Thanks Austin for deep dive this issue. After upgrading commons-io, the issue is gone?

@austintlee
Copy link
Collaborator

That and the plugin-security.policy change. I am not able to repro it after my fix.

@manzke
Copy link

manzke commented Feb 26, 2024

Any update on this one? I can reproduce it with docker on linux (aws)

@manzke
Copy link

manzke commented Feb 26, 2024

can it be that this was fixed with 2.12. and the commons-io upgrade? I can't test it right now, because we have a dependency on 2.11.

  1. what was the fix, just the upgrade? possible to backport (by us)?
  2. was it fixed in the last release?

@dhrubo-os
Copy link
Collaborator Author

Closing this issue, tested in my end. I don't see this issue anymore.

@github-project-automation github-project-automation bot moved this from In Progress to Done in ml-commons projects Apr 9, 2024
@jlibx
Copy link

jlibx commented Aug 9, 2024

opensearch-ml-gpu | [2024-08-09T07:17:54,585][DEBUG][o.o.m.e.u.FileUtils ] [opensearch-ml-gpu] merge 61 files into /usr/share/opensearch/data/ml_cache/models_cache/deploy/HdPgNZEBkGu7typLkQJX/cre_pt_v0_2_0_test2.zip
opensearch-ml-gpu | [2024-08-09T07:17:54,782][DEBUG][o.o.t.TransportService ] [opensearch-ml-gpu] Action: internal:coordination/fault_detection/leader_check
opensearch-ml-gpu | [2024-08-09T07:17:54,961][DEBUG][o.o.n.r.t.AverageMemoryUsageTracker] [opensearch-ml-gpu] Recording memory usage: 25%
opensearch-ml-gpu | [2024-08-09T07:17:54,990][DEBUG][o.o.n.r.t.AverageCpuUsageTracker] [opensearch-ml-gpu] Recording cpu usage: 0%
opensearch-ml-gpu | [2024-08-09T07:17:55,461][DEBUG][o.o.n.r.t.AverageMemoryUsageTracker] [opensearch-ml-gpu] Recording memory usage: 26%
opensearch-ml-gpu | [2024-08-09T07:17:55,491][DEBUG][o.o.n.r.t.AverageCpuUsageTracker] [opensearch-ml-gpu] Recording cpu usage: 6%
opensearch-ml-gpu | [2024-08-09T07:17:55,898][DEBUG][o.o.t.TransportService ] [opensearch-ml-gpu] Action: internal:coordination/fault_detection/leader_check
opensearch-ml-gpu | [2024-08-09T07:17:55,962][DEBUG][o.o.n.r.t.AverageMemoryUsageTracker] [opensearch-ml-gpu] Recording memory usage: 26%
opensearch-ml-gpu | [2024-08-09T07:17:55,991][DEBUG][o.o.n.r.t.AverageCpuUsageTracker] [opensearch-ml-gpu] Recording cpu usage: 0%
opensearch-ml-gpu | [2024-08-09T07:17:56,055][ERROR][o.o.m.m.MLModelManager ] [opensearch-ml-gpu] Model content hash can't match original hash value
opensearch-ml-gpu | [2024-08-09T07:17:56,055][DEBUG][o.o.m.m.MLModelCacheHelper] [opensearch-ml-gpu] removing model HdPgNZEBkGu7typLkQJX from cache
opensearch-ml-gpu | [2024-08-09T07:17:56,148][DEBUG][o.o.m.m.MLModelCacheHelper] [opensearch-ml-gpu] Setting the auto deploying flag for Model HdPgNZEBkGu7typLkQJX
opensearch-ml-gpu | [2024-08-09T07:17:56,148][DEBUG][o.o.m.t.MLTaskManager ] [opensearch-ml-gpu] remove ML task from cache aYjwNZEBtLXkNmkPz_gQ
opensearch-ml-gpu | [2024-08-09T07:17:56,148][DEBUG][o.o.t.TransportService ] [opensearch-ml-gpu] Action: cluster:admin/opensearch/mlinternal/forward
opensearch-ml-gpu | [2024-08-09T07:17:56,279][INFO ][o.o.m.a.d.TransportDeployModelOnNodeAction] [opensearch-ml-gpu] deploy model task done aYjwNZEBtLXkNmkPz_gQ

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Development

No branches or pull requests

9 participants