[PROPOSAL] Uninstall / Rollback / Overwrite capability for resources #32

dbwiddis · 2023-09-13T19:09:11Z

What/Why

What are you proposing?

During execution of a workflow, this framework will create resources on the OpenSearch cluster. Users need the ability to quickly "clean up" these resources for multiple scenarios:

After experimenting/prototyping with different approaches or models
On partial failure of a workflow

Such resources can include (this is a partial list):

Indices
ML Models
Search and ingest pipelines
Workflow templates and sub-templates

We need a capability to:

Track what resources have been created in a workflow execution
Automate the removal of these resources (if possible)
Provide users with a specific list of resources created to enable manual removal in case automated removal fails, or they want more control over the process.

What problems are you trying to solve?

When a user no longer has need for the resources supporting a workflow, we should free up the storage resources allocated to them.
When a user iterates on a workflow design, we want to selectively replace only the changed portions.
When a user experiences an exception executing a workflow, we want them to have the ability to roll back the cluster to its state prior to the failed execution.

What is the developer experience going to be?

When executing a workflow, the developer should have options to automatically roll back on failure.
When no longer using a workflow, the developer should have an API to "uninstall" it completely.
The developer should have access to a complete list of resources created to support a workflow so they can roll back manually.

Are there any security considerations?

Workflows will require user identity to access some data associated with their execution. While automated rollback on failure would still have the user identity, a later "uninstall" may not have permission to remove all resources.

What is the user experience going to be?

Seeking community feedback! How do you want this feature to be used?

Why should it be built? Any reason not to?

If this feature is not built, developers will need to expend time and effort identifying resources to remove themselves, and/or pay for more storage to maintain unused data.

What will it take to execute?

In the near term we should keep track of resources created in order to enable manual removal, regardless of any more automatic behavior.

We should research best practices for installation scripts, as this is not a new problem, just a new problem space.

Actual execution will require a detailed design that will be created based on feedback to this proposal.

Any remaining open questions?

Could such a feature be useful as an OpenSearch core feature? For example, if we had a global parameter (similar to "pretty" and "verbose" tags on URIs) for a "resource group", we could apply that to every request, but that capability would be useful even outside of an automation framework.
Any "automated deletion" functionality carries significant risk of inadvertent data loss. Can/should we integrate this functionality with snapshots / other backup, to enable recovery of an accidental deletion?

austintlee · 2023-10-10T01:24:08Z

One of the things that really frustrated me when working with models using ml-commons is its partial failure. When you deploy a model (in this case, a pre-trained text embedding model), deployment can "succeed" with partial deployment meaning it succeeded on some of the nodes (ml eligible nodes), but not on all. I can still go ahead and use the model, but I can see this being problematic if you actually set up N nodes to handle a certain amount of workload and with fewer than N nodes being available to handle the work when the workflow reported success on all tasks. I suspect a lot of users will not want to pay for GPU instances just for neural search and text embeddings so when you have data nodes also assigned to run ml workloads, then partial failure not being surfaced appropriately can cause performance issues. At the moment, I think such partial failure is not reported as failure at the workflow level? I had a quick look at the code and it looks like it only accepts task status of SUCCESS so maybe SUCCESS_WITH_ERROR from model deployment does fail the workflow execution. Maybe the default is to fail, but people can override it with an option to allow partial success at the task level.

Workflows will require user identity to access some data associated with their execution.

I don't know if you really mean that the AI Workflow plugin will only work with security enabled?

At the moment, the conversational memory feature we put in ml-commons does not support role-based access control. I think for things like workflows, people should really use roles and not have any of the resources created by a workflow tied to a single user. But this would also mean that all resources created by AI Workflows would need to support role-based access control. Does this make sense? To require a minimum level of access control mechanism on all resources? I think this would also simplify rollback.

Ideally, you want all the resources (in scope for AI Workflows) taggable so it is easy for both the plugin or some human operator to easily identify things to clean up. Is materializing the DAG of a workflow idempotent? Tags and idempotency should help to prevent creation of unnecessary resources and clean-up tasks.

dbwiddis · 2023-10-10T15:27:17Z

I had a quick look at the code and it looks like it only accepts task status of SUCCESS so maybe SUCCESS_WITH_ERROR from model deployment does fail the workflow execution.

Great suggestion.

I don't know if you really mean that the AI Workflow plugin will only work with security enabled?

More that if it is enabled and there are access controls on the data, then that user can't get around those controls by using a workflow; the workflow would run with the user's permissions. The idea here is we know what index they'd be trying to access up front and we should fast-fail if we know in advance that user won't have access to one of the intermediate steps.

I think for things like workflows, people should really use roles and not have any of the resources created by a workflow tied to a single user. But this would also mean that all resources created by AI Workflows would need to support role-based access control. Does this make sense? To require a minimum level of access control mechanism on all resources? I think this would also simplify rollback.

I don't think a majority of users use security, so I don't think requiring this will work.

Ideally, you want all the resources (in scope for AI Workflows) taggable so it is easy for both the plugin or some human operator to easily identify things to clean up.

+1!

dbwiddis · 2023-12-15T02:15:29Z

Closing this proposal as most of the features proposed have been implemented in support of 2.12.0 release: users know what was provisioned and have an easy way to deprovision. I'll be creating a new proposal for more fine-graned deprovisioning / update which is dependent on more update APIs becoming available, targeting 2.13.0.

github-actions bot added the untriaged label Sep 13, 2023

dbwiddis added enhancement New feature or request discuss and removed untriaged labels Sep 13, 2023

owaiskazi19 mentioned this issue Oct 5, 2023

[FEATURE] Implement DELETE API for the workflow #23

Closed

dbwiddis mentioned this issue Oct 12, 2023

[META] Implement basic manual Undo/Rollback capability #89

Closed

5 tasks

minalsha assigned dbwiddis Oct 27, 2023

dbwiddis linked a pull request Dec 14, 2023 that will close this issue

[Feature/agent_framework] Deprovision API #271

Merged

dbwiddis mentioned this issue Dec 14, 2023

[Feature/agent_framework] Deprovision API #271

Merged

dbwiddis added the v2.12.0 label Dec 14, 2023

dbwiddis closed this as completed Dec 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PROPOSAL] Uninstall / Rollback / Overwrite capability for resources #32

[PROPOSAL] Uninstall / Rollback / Overwrite capability for resources #32

dbwiddis commented Sep 13, 2023

austintlee commented Oct 10, 2023

dbwiddis commented Oct 10, 2023

dbwiddis commented Dec 15, 2023

[PROPOSAL] Uninstall / Rollback / Overwrite capability for resources #32

[PROPOSAL] Uninstall / Rollback / Overwrite capability for resources #32

Comments

dbwiddis commented Sep 13, 2023

What/Why

What are you proposing?

What problems are you trying to solve?

What is the developer experience going to be?

Are there any security considerations?

What is the user experience going to be?

Why should it be built? Any reason not to?

What will it take to execute?

Any remaining open questions?

austintlee commented Oct 10, 2023

dbwiddis commented Oct 10, 2023

dbwiddis commented Dec 15, 2023