-
Notifications
You must be signed in to change notification settings - Fork 138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Orphaned ML Models #1179
Comments
There are few confusions. I'm going to clarify all those one by one.
Also if you want to upload any pre-trained model, you can check this page
In 2.8, we introduced model group. Ideal scenario is: We create a model group and then provide the model group id during the model registration. To make it flexible for customer, we kept the model group id optional during the model registration. So in model upload/registration request, if there's no model group id is provided, then internally we create a model group with the same name of the model and put that model under that group. We expect model name to be unique. So in this case what happened is, first time when you uploaded a model it created a model group with the same model name and then when you tried to upload the same model again, it tried to create another model group with the same name and threw that error because model group name is unique. Model group related documentation:
I hope this clarifies the confusion. |
@dhrubo-os it makes sense. But from the error root cause, it really doesn't explain the problem. We should update our error message to say "conflict with model group ID, please retry with a different model group". This would be helpful. Couple of questions:
|
@saratvemulapalli Yeah I completely agree with this. We tried to improve the error message in this PR
Yes, I agree we should clean up the model group is the model is not registered successfully. @rbhavna we don't do this now, right? |
Okay, afk but I will give this thread a thorough read to see. In addition if model groups are getting implicitly created we should update the documentation on this page: https://opensearch.org/docs/latest/ml-commons-plugin/ml-framework/ I had been following that thinking it was a complete set of documentation which is a bit misleading. Perhaps we should rethink how this documentation is broken up. Happy to put a proposal out there since I am working through this at the moment. |
@dtaivpp I think it can help a lot if you can help improving the document. You can cut issue here https://github.com/opensearch-project/documentation-website/issues |
Closing this as a majority of the issues were tied to the error message. |
What is the bug?
When a model upload fails there is an issue where the model still exists but it cannot be removed/unloaded.
How can one reproduce the bug?
Steps to reproduce the behavior:
This creates a Task ID that can be tracked. Viewing the task id it has clearly failed.
The above outputs indicates the TaskID we had been given is actually the ModelID. Attempting to
_unload
the model with what I believe is really the TaskID yields the following:We cannot load a new model with that name however as it is still in the system:
What is the expected behavior?
If a model creation fails I expect to either be able to delete the failed upload or to upload a new model with the same name without issues.
What is your host/environment?
The text was updated successfully, but these errors were encountered: