Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PROPOSAL] Uninstall / Rollback / Overwrite capability for resources #32

Closed
dbwiddis opened this issue Sep 13, 2023 · 3 comments · Fixed by #271
Closed

[PROPOSAL] Uninstall / Rollback / Overwrite capability for resources #32

dbwiddis opened this issue Sep 13, 2023 · 3 comments · Fixed by #271
Assignees
Labels
discuss enhancement New feature or request v2.12.0

Comments

@dbwiddis
Copy link
Member

What/Why

What are you proposing?

During execution of a workflow, this framework will create resources on the OpenSearch cluster. Users need the ability to quickly "clean up" these resources for multiple scenarios:

  • After experimenting/prototyping with different approaches or models
  • On partial failure of a workflow

Such resources can include (this is a partial list):

  • Indices
  • ML Models
  • Search and ingest pipelines
  • Workflow templates and sub-templates

We need a capability to:

  1. Track what resources have been created in a workflow execution
  2. Automate the removal of these resources (if possible)
  3. Provide users with a specific list of resources created to enable manual removal in case automated removal fails, or they want more control over the process.

What problems are you trying to solve?

  • When a user no longer has need for the resources supporting a workflow, we should free up the storage resources allocated to them.
  • When a user iterates on a workflow design, we want to selectively replace only the changed portions.
  • When a user experiences an exception executing a workflow, we want them to have the ability to roll back the cluster to its state prior to the failed execution.

What is the developer experience going to be?

  • When executing a workflow, the developer should have options to automatically roll back on failure.
  • When no longer using a workflow, the developer should have an API to "uninstall" it completely.
  • The developer should have access to a complete list of resources created to support a workflow so they can roll back manually.

Are there any security considerations?

  • Workflows will require user identity to access some data associated with their execution. While automated rollback on failure would still have the user identity, a later "uninstall" may not have permission to remove all resources.

What is the user experience going to be?

Seeking community feedback! How do you want this feature to be used?

Why should it be built? Any reason not to?

If this feature is not built, developers will need to expend time and effort identifying resources to remove themselves, and/or pay for more storage to maintain unused data.

What will it take to execute?

In the near term we should keep track of resources created in order to enable manual removal, regardless of any more automatic behavior.

We should research best practices for installation scripts, as this is not a new problem, just a new problem space.

Actual execution will require a detailed design that will be created based on feedback to this proposal.

Any remaining open questions?

  • Could such a feature be useful as an OpenSearch core feature? For example, if we had a global parameter (similar to "pretty" and "verbose" tags on URIs) for a "resource group", we could apply that to every request, but that capability would be useful even outside of an automation framework.
  • Any "automated deletion" functionality carries significant risk of inadvertent data loss. Can/should we integrate this functionality with snapshots / other backup, to enable recovery of an accidental deletion?
@austintlee
Copy link

One of the things that really frustrated me when working with models using ml-commons is its partial failure. When you deploy a model (in this case, a pre-trained text embedding model), deployment can "succeed" with partial deployment meaning it succeeded on some of the nodes (ml eligible nodes), but not on all. I can still go ahead and use the model, but I can see this being problematic if you actually set up N nodes to handle a certain amount of workload and with fewer than N nodes being available to handle the work when the workflow reported success on all tasks. I suspect a lot of users will not want to pay for GPU instances just for neural search and text embeddings so when you have data nodes also assigned to run ml workloads, then partial failure not being surfaced appropriately can cause performance issues. At the moment, I think such partial failure is not reported as failure at the workflow level? I had a quick look at the code and it looks like it only accepts task status of SUCCESS so maybe SUCCESS_WITH_ERROR from model deployment does fail the workflow execution. Maybe the default is to fail, but people can override it with an option to allow partial success at the task level.

Workflows will require user identity to access some data associated with their execution.

I don't know if you really mean that the AI Workflow plugin will only work with security enabled?

At the moment, the conversational memory feature we put in ml-commons does not support role-based access control. I think for things like workflows, people should really use roles and not have any of the resources created by a workflow tied to a single user. But this would also mean that all resources created by AI Workflows would need to support role-based access control. Does this make sense? To require a minimum level of access control mechanism on all resources? I think this would also simplify rollback.

Ideally, you want all the resources (in scope for AI Workflows) taggable so it is easy for both the plugin or some human operator to easily identify things to clean up. Is materializing the DAG of a workflow idempotent? Tags and idempotency should help to prevent creation of unnecessary resources and clean-up tasks.

@dbwiddis
Copy link
Member Author

I had a quick look at the code and it looks like it only accepts task status of SUCCESS so maybe SUCCESS_WITH_ERROR from model deployment does fail the workflow execution.

Great suggestion.

I don't know if you really mean that the AI Workflow plugin will only work with security enabled?

More that if it is enabled and there are access controls on the data, then that user can't get around those controls by using a workflow; the workflow would run with the user's permissions. The idea here is we know what index they'd be trying to access up front and we should fast-fail if we know in advance that user won't have access to one of the intermediate steps.

I think for things like workflows, people should really use roles and not have any of the resources created by a workflow tied to a single user. But this would also mean that all resources created by AI Workflows would need to support role-based access control. Does this make sense? To require a minimum level of access control mechanism on all resources? I think this would also simplify rollback.

I don't think a majority of users use security, so I don't think requiring this will work.

Ideally, you want all the resources (in scope for AI Workflows) taggable so it is easy for both the plugin or some human operator to easily identify things to clean up.

+1!

@dbwiddis
Copy link
Member Author

Closing this proposal as most of the features proposed have been implemented in support of 2.12.0 release: users know what was provisioned and have an easy way to deprovision. I'll be creating a new proposal for more fine-graned deprovisioning / update which is dependent on more update APIs becoming available, targeting 2.13.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss enhancement New feature or request v2.12.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants