-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PROPOSAL] Uninstall / Rollback / Overwrite capability for resources #32
Comments
One of the things that really frustrated me when working with models using ml-commons is its partial failure. When you deploy a model (in this case, a pre-trained text embedding model), deployment can "succeed" with partial deployment meaning it succeeded on some of the nodes (ml eligible nodes), but not on all. I can still go ahead and use the model, but I can see this being problematic if you actually set up N nodes to handle a certain amount of workload and with fewer than N nodes being available to handle the work when the workflow reported success on all tasks. I suspect a lot of users will not want to pay for GPU instances just for neural search and text embeddings so when you have data nodes also assigned to run ml workloads, then partial failure not being surfaced appropriately can cause performance issues. At the moment, I think such partial failure is not reported as failure at the workflow level? I had a quick look at the code and it looks like it only accepts task status of SUCCESS so maybe SUCCESS_WITH_ERROR from model deployment does fail the workflow execution. Maybe the default is to fail, but people can override it with an option to allow partial success at the task level.
I don't know if you really mean that the AI Workflow plugin will only work with security enabled? At the moment, the conversational memory feature we put in ml-commons does not support role-based access control. I think for things like workflows, people should really use roles and not have any of the resources created by a workflow tied to a single user. But this would also mean that all resources created by AI Workflows would need to support role-based access control. Does this make sense? To require a minimum level of access control mechanism on all resources? I think this would also simplify rollback. Ideally, you want all the resources (in scope for AI Workflows) taggable so it is easy for both the plugin or some human operator to easily identify things to clean up. Is materializing the DAG of a workflow idempotent? Tags and idempotency should help to prevent creation of unnecessary resources and clean-up tasks. |
Great suggestion.
More that if it is enabled and there are access controls on the data, then that user can't get around those controls by using a workflow; the workflow would run with the user's permissions. The idea here is we know what index they'd be trying to access up front and we should fast-fail if we know in advance that user won't have access to one of the intermediate steps.
I don't think a majority of users use security, so I don't think requiring this will work.
+1! |
Closing this proposal as most of the features proposed have been implemented in support of 2.12.0 release: users know what was provisioned and have an easy way to deprovision. I'll be creating a new proposal for more fine-graned deprovisioning / update which is dependent on more update APIs becoming available, targeting 2.13.0. |
What/Why
What are you proposing?
During execution of a workflow, this framework will create resources on the OpenSearch cluster. Users need the ability to quickly "clean up" these resources for multiple scenarios:
Such resources can include (this is a partial list):
We need a capability to:
What problems are you trying to solve?
What is the developer experience going to be?
Are there any security considerations?
What is the user experience going to be?
Seeking community feedback! How do you want this feature to be used?
Why should it be built? Any reason not to?
If this feature is not built, developers will need to expend time and effort identifying resources to remove themselves, and/or pay for more storage to maintain unused data.
What will it take to execute?
In the near term we should keep track of resources created in order to enable manual removal, regardless of any more automatic behavior.
We should research best practices for installation scripts, as this is not a new problem, just a new problem space.
Actual execution will require a detailed design that will be created based on feedback to this proposal.
Any remaining open questions?
The text was updated successfully, but these errors were encountered: