Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Proposal] Publish an event when long running operation complete #5479

Closed
xluo-aws opened this issue Dec 7, 2022 · 9 comments
Closed

[Proposal] Publish an event when long running operation complete #5479

xluo-aws opened this issue Dec 7, 2022 · 9 comments
Labels
discuss Issues intended to help drive brainstorming and decision making enhancement Enhancement or improvement to existing feature or request extensions

Comments

@xluo-aws
Copy link
Member

xluo-aws commented Dec 7, 2022

Is your feature request related to a problem? Please describe.
There are index operations that may take tens of minutes or even hours, for example, reindex, split, shrink , etc. We want to send out notifications(configured in ISM dashboard plugin) to user when they are completed, no matter the operation is submitted from ISM dashboard plugin or command line.

Describe the solution you'd like
We brainstormed a few options, the preferred one is to enhance opensearch core logic to publish an event when operation is complete. We can create listener in ISM plugin to listen to the event and send out notification. We checked existing event that plugin listen to, ClusterChangeEvent is one (not sure if there are others) that will be published when we split/shrink an index. However this event doesn't have information that's required to send out notification, for example, who submit the operation request. Other cons of this solution is Reindex will not trigger ClusterChangeEvent, so it's not a general solution.
Another possible solution is to publish a new event when long running operation is triggered. The listener in plugin will create a scheduled event to check the operation status every x minutes and send out notification once it's completed. The extension point could be extend RestToXContentListener for RestResizeHandler and RestReindexAction to publish an event, or extend TransportResizeAction/TransportReindexAction to publish event This is similar to the 2nd alternative below but has less impact because it only affects few long running operations.

Describe alternatives you've considered
1 Create wrapper API in ISM plugin, it will call existing index operation API first then create a scheduled job to check operation status every x minutes then send out notification once it's completed. This requires user to switch to new wrapper API.
2 Create actionFilter in ISM plugin to filter all requests and create a scheduled job if the request is long running operations. The major concern is performance impact. However ISM already has an actionFilter that intercept all request, we guess this solution should already have passed performance review so it's not a totally new performance risk. We can do some performance test if this can be a candidate solution.
3 For reindex, we can leverage IndexingOperationListener to monitor .task index, reindex will write to this index upon completion, we can then send out notification. For Shrink and Split, we can leverage ClusterStateChange event to find out which index is created and whether it's created due to resize or not, if it's resize, we compare its shard with source index shards to figure out it's split or shrink, then we wait for active shards to be ready(same logic as how we tell a create index operation is done) and send out notification. All coding change is in ISM plugin.

Additional context
Add any other context or screenshots about the feature request here.

@xluo-aws xluo-aws added enhancement Enhancement or improvement to existing feature or request untriaged labels Dec 7, 2022
@xluo-aws
Copy link
Member Author

xluo-aws commented Dec 8, 2022

This ticket is related to opensearch-project/index-management-dashboards-plugin#284.

@dblock
Copy link
Member

dblock commented Dec 9, 2022

I like a generic solution in which any extension can subscribe to events, and any action can publish an event which would propagate across the cluster when there's someone listening. Events should be durable/come with certain delivery guarantees as well.

@xuezhou25 xuezhou25 added the discuss Issues intended to help drive brainstorming and decision making label Dec 9, 2022
@Hailong-am
Copy link

For reindex, we can create scheduled job in ISM plugin to monitor .task index, if it's a reindex task, we'll create another scheduled job to check task status every x minutes and send out notification once it's completed.

we don't need to create a monitor job, the timing of writing into .tasks index is task complete. In that case, what we need is parse the task result to see if there has any errors or failures, and then send out notification accordingly.

@xluo-aws
Copy link
Member Author

For reindex, we can create scheduled job in ISM plugin to monitor .task index, if it's a reindex task, we'll create another scheduled job to check task status every x minutes and send out notification once it's completed.

we don't need to create a monitor job, the timing of writing into .tasks index is task complete. In that case, what we need is parse the task result to see if there has any errors or failures, and then send out notification accordingly.

Thanks for pointing it out, I just updated the description.

@nknize
Copy link
Collaborator

nknize commented Jan 18, 2023

@xluo-aws have you looked into ResourceWatcher and what might be missing to achieve the objective? A ResourceWatcher can be registered through ResourceWatcherService#add which will notify the registered Watcher instance through AbstractResourceWatcher#checkAndNotify at a given Frequency interval (which can be user defined).

@xluo-aws
Copy link
Member Author

Nick, Thanks for the suggestion. We are not ware of the resourcewatcher until now but after a quick look at the code it seems we can leverage it to keep checking the operation status until it's completed. This makes publish an event at index operation submit time more convenient. Will do more research and provide an update.

@gaobinlong
Copy link
Collaborator

We have a new idea about this issue, similar to reindex, we can make all of the other long running operations like shrink/split/clone can be tracked by _tasks API firstly, then we can monitor the .tasks index, when a new long running operation completes or fails, we will send notification to the user. We think this maybe a generic solution and it also has other benefits, I've created another issue about making some long running operations can be tracked by _tasks API, @dblock @nknize could you please help to take a look at this: #6228?

@Hailong-am
Copy link

We have a new idea about this issue, similar to reindex, we can make all of the other long running operations like shrink/split/clone can be tracked by _tasks API firstly, then we can monitor the .tasks index, when a new long running operation completes or fails, we will send notification to the user. We think this maybe a generic solution and it also has other benefits, I've created another issue about making some long running operations can be tracked by _tasks API, @dblock @nknize could you please help to take a look at this: #6228?

Based on this assumption, we could have a IndexOperationListener watch on .tasks index. Once there has a new document write into this index which means a task has completed, we can parse and extract action from the document and to see whether notification is needed for this action.

To have a IndexOperationListener is a lightweight and clean solution by comparing to use JobScheduler plugin or ResourceWatcher to keep monitor on the long running operation status. There also have some limitations, since the task execution result persist into .tasks index happened when task complete and task informations are in memory, when node restart those information will be lost and no way to track task execution anymore.

@xluo-aws
Copy link
Member Author

Close this one: Our final solution is to update long running operation to tasks so we can check task status to find out if the long running operation is completed or not. The change will be done in release 2.7. Ticket number are:
opensearch-project/index-management-dashboards-plugin#615 and opensearch-project/index-management-dashboards-plugin#624

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss Issues intended to help drive brainstorming and decision making enhancement Enhancement or improvement to existing feature or request extensions
Projects
None yet
Development

No branches or pull requests

7 participants