Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Orphaned ML Models #1179

Closed
dtaivpp opened this issue Aug 2, 2023 · 6 comments
Closed

[BUG] Orphaned ML Models #1179

dtaivpp opened this issue Aug 2, 2023 · 6 comments
Labels
documentation Improvements or additions to documentation

Comments

@dtaivpp
Copy link
Contributor

dtaivpp commented Aug 2, 2023

What is the bug?
When a model upload fails there is an issue where the model still exists but it cannot be removed/unloaded.

How can one reproduce the bug?
Steps to reproduce the behavior:

  1. Upload a model on (what I guess is an unsupported platform?) M1 macbook.
POST /_plugins/_ml/models/_upload 
{
  "name": "TEST_MODEL",
  "version": "1.0.0",
  "description": "MiniLM",
  "model_format": "TORCH_SCRIPT",
  "model_config": {
    "model_type": "bert",
    "embedding_dimension": 384,
    "framework_type": "sentence_transformers"
  },
  "url": "https://github.com/opensearch-project/ml-commons/raw/2.x/ml-algorithms/src/test/resources/org/opensearch/ml/engine/algorithms/text_embedding/all-MiniLM-L6-v2_torchscript_sentence-transformer.zip?raw=true"
}

This creates a Task ID that can be tracked. Viewing the task id it has clearly failed.
Screenshot 2023-08-02 at 4 31 27 PM

The above outputs indicates the TaskID we had been given is actually the ModelID. Attempting to _unload the model with what I believe is really the TaskID yields the following:
Screenshot 2023-08-02 at 4 36 02 PM

We cannot load a new model with that name however as it is still in the system:
Screenshot 2023-08-02 at 4 33 01 PM

What is the expected behavior?
If a model creation fails I expect to either be able to delete the failed upload or to upload a new model with the same name without issues.

What is your host/environment?

  • OS: MacOS (M1)
  • Version 2.9
@dhrubo-os
Copy link
Collaborator

There are few confusions. I'm going to clarify all those one by one.

  1. Model content changed issue:
    The post request object you used is not complete. To maintain security, we introduced model_content_hash_value which is needed to provide in the post request. During model registration, internally we calculate the hash value of the model file and then compare with the given hash value to make sure there wasn't any kind of security breach happened during the uploading. For the right post request, please check this example

Also if you want to upload any pre-trained model, you can check this page

  1. About We cannot load a new model with that name however as it is still in the system: :

In 2.8, we introduced model group.

Ideal scenario is: We create a model group and then provide the model group id during the model registration.

To make it flexible for customer, we kept the model group id optional during the model registration. So in model upload/registration request, if there's no model group id is provided, then internally we create a model group with the same name of the model and put that model under that group. We expect model name to be unique.

So in this case what happened is, first time when you uploaded a model it created a model group with the same model name and then when you tried to upload the same model again, it tried to create another model group with the same name and threw that error because model group name is unique.

Model group related documentation:

  1. https://opensearch.org/docs/latest/ml-commons-plugin/model-access-control/#registering-a-public-model-group
  2. https://opensearch-project.github.io/opensearch-py-ml/examples/demo_ml_commons_integration.html#Step-1:-Upload-NLP-model-from-local-file-to-Opensearch-cluster

I hope this clarifies the confusion.

@saratvemulapalli
Copy link
Member

@dhrubo-os it makes sense. But from the error root cause, it really doesn't explain the problem. We should update our error message to say "conflict with model group ID, please retry with a different model group". This would be helpful.

Couple of questions:

  • If we implicitly create a model group, is there anyway we could notify the user that we created a model group with name xyz (assuming the API is async). I believe it would be a better customer experience.
  • When the model _upload failed, shouldn't we clean up the model group we created ? Ideally that would be the cleanest experience. For the user it feels like there is lingering context which they dont know :/.

@dhrubo-os
Copy link
Collaborator

@saratvemulapalli Yeah I completely agree with this. We tried to improve the error message in this PR

When the model _upload failed, shouldn't we clean up the model group we created ? Ideally that would be the cleanest experience. For the user it feels like there is lingering context which they dont know :/.

Yes, I agree we should clean up the model group is the model is not registered successfully. @rbhavna we don't do this now, right?

@dtaivpp
Copy link
Contributor Author

dtaivpp commented Aug 9, 2023

Okay, afk but I will give this thread a thorough read to see. In addition if model groups are getting implicitly created we should update the documentation on this page:

https://opensearch.org/docs/latest/ml-commons-plugin/ml-framework/

I had been following that thinking it was a complete set of documentation which is a bit misleading. Perhaps we should rethink how this documentation is broken up. Happy to put a proposal out there since I am working through this at the moment.

@ylwu-amzn
Copy link
Collaborator

@dtaivpp I think it can help a lot if you can help improving the document. You can cut issue here https://github.com/opensearch-project/documentation-website/issues

@ylwu-amzn ylwu-amzn added documentation Improvements or additions to documentation and removed bug Something isn't working untriaged labels Aug 14, 2023
@dtaivpp
Copy link
Contributor Author

dtaivpp commented Jun 5, 2024

Closing this as a majority of the issues were tied to the error message.

@dtaivpp dtaivpp closed this as completed Jun 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

4 participants