-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reindex resiliency #42612
Labels
:Distributed Indexing/Reindex
Issues relating to reindex that are not caused by issues further down
Meta
Team:Distributed (Obsolete)
Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.
Comments
henningandersen
added
WIP
Meta
:Distributed Indexing/Reindex
Issues relating to reindex that are not caused by issues further down
labels
May 28, 2019
Pinging @elastic/es-distributed |
henningandersen
added a commit
to henningandersen/elasticsearch
that referenced
this issue
Jul 2, 2019
Refactor ScrollableHitSource to pump data out and have a simplified interface (callers should no longer call startNextScroll, instead they simply mark that they are done with the previous result, triggering a new batch of data). This eases making reindex resilient, since we will sometimes need to rerun search during retries. Relates elastic#43187 and elastic#42612
I made some updates to the meta issues under coordinator node. |
henningandersen
added a commit
that referenced
this issue
Jul 9, 2019
Refactor ScrollableHitSource to pump data out and have a simplified interface (callers should no longer call startNextScroll, instead they simply mark that they are done with the previous result, triggering a new batch of data). This eases making reindex resilient, since we will sometimes need to rerun search during retries. Relates #43187 and #42612
henningandersen
added a commit
that referenced
this issue
Jul 9, 2019
Refactor ScrollableHitSource to pump data out and have a simplified interface (callers should no longer call startNextScroll, instead they simply mark that they are done with the previous result, triggering a new batch of data). This eases making reindex resilient, since we will sometimes need to rerun search during retries. Relates #43187 and #42612
henningandersen
added a commit
to henningandersen/elasticsearch
that referenced
this issue
Jul 11, 2019
The client and remote hit sources had each their own retry mechanism, which would do the same. Supporting resiliency we would have to expand on the retry mechanisms and as a preparation for that, the retry mechanism is now shared such that each sub class is only responsible for sending requests and converting responses/failures to common format. Part of elastic#42612
Tim-Brooks
added a commit
that referenced
this issue
Jul 18, 2019
This is related to #42612. Currently the reindexing transport action creates a task on the local coordinator node. Unfortunately this is not resilient to coordinator node failures. This commit adds a new action that creates a reindexing job as a persistent task.
Tim-Brooks
added a commit
to Tim-Brooks/elasticsearch
that referenced
this issue
Aug 5, 2019
TransportStartReindexJobAction is currently a master action. Reindexing only needs access to the cluster state to perform some validations. Prior to persistent reindexing we used a normal data node to perform these validations. There is not reason that these validations need a perfectly update to date view of the cluster state. The commit changes the action to be a normal transport action. Relates to elastic#42612.
henningandersen
added a commit
that referenced
this issue
Aug 6, 2019
The client and remote hit sources had each their own retry mechanism, which would do the same. Supporting resiliency we would have to expand on the retry mechanisms and as a preparation for that, the retry mechanism is now shared such that each sub class is only responsible for sending requests and converting responses/failures to common format. Part of #42612
Tim-Brooks
added a commit
that referenced
this issue
Aug 6, 2019
TransportStartReindexJobAction is currently a master action. Reindexing only needs access to the cluster state to perform some validations. Prior to persistent reindexing we used a normal data node to perform these validations. There is not reason that these validations need a perfectly update to date view of the cluster state. The commit changes the action to be a normal transport action. Relates to #42612.
henningandersen
added a commit
to henningandersen/elasticsearch
that referenced
this issue
Aug 8, 2019
The client and remote hit sources had each their own retry mechanism, which would do the same. Supporting resiliency we would have to expand on the retry mechanisms and as a preparation for that, the retry mechanism is now shared such that each sub class is only responsible for sending requests and converting responses/failures to common format. Part of elastic#42612
henningandersen
added a commit
that referenced
this issue
Aug 8, 2019
The client and remote hit sources had each their own retry mechanism, which would do the same. Supporting resiliency we would have to expand on the retry mechanisms and as a preparation for that, the retry mechanism is now shared such that each sub class is only responsible for sending requests and converting responses/failures to common format. Part of #42612
henningandersen
added a commit
to henningandersen/elasticsearch
that referenced
this issue
Aug 13, 2019
Local reindex can now survive loosing data nodes that contain source data. The original query will be restarted with a filter for `_seq_no >= last_seq_no` when a failure is detected. Part of elastic#42612 and split out from elastic#43187
Tim-Brooks
added a commit
that referenced
this issue
Aug 14, 2019
Currently the result of a reindex persistent task is propogated and stored in the cluster state. This commit changes this so that only the ephemeral task-id, headers, and reindex state is store in the cluster state. Any result (exception or response) is stored in the reindex index. Relates to #42612.
Tim-Brooks
added a commit
that referenced
this issue
Nov 20, 2019
This is related to #42612. This commit deletes TODOs related to test assertions changing due to exception serialization changes. We have determined that the exception x-content serialization was never stable in the first place (reading from the task index), so we are okay with these changes.
henningandersen
added a commit
to henningandersen/elasticsearch
that referenced
this issue
Nov 22, 2019
Renamed types and action names to fit that we now call it a reindex task and not a job. Removed action and named writeable todos. Relates elastic#42612
henningandersen
added a commit
to henningandersen/rally-tracks
that referenced
this issue
Nov 25, 2019
Add reindex operation to http_logs track in order to verify reindex performance before and after resilient reindex implementation as well as keep an eye on it for the future. Relates elastic/elasticsearch#42612
henningandersen
added a commit
that referenced
this issue
Nov 26, 2019
Ensure that reindex works in a mixed cluster state during rolling upgrade by not doing resilient reindex until all nodes are on the new version. Relates #42612
henningandersen
added a commit
that referenced
this issue
Nov 27, 2019
Renamed types and action names to fit that we now call it a reindex task and not a job. Removed action and named writeable todos. Relates #42612
Tim-Brooks
added a commit
that referenced
this issue
Nov 27, 2019
This is related to #42612. It adds a setting to configure what headers are stored by the persistent reindexing task for further requests. Additionally, it has the x-pack security module automatically configure this setting to ensure security works with reindexing.
henningandersen
added a commit
to henningandersen/elasticsearch
that referenced
this issue
Dec 5, 2019
Resolve indices before starting to reindex. This ensures that the list of indices does not change when failing over (TBD). The one exception to this is aliases, which we still need to access through the alias. In addition, resolved index patterns are sorted by create-date and otherwise the listed order is preserved. This ensures that once we reindex one index at a time, we will get reasonable time locality for time based indices. The resolved list of indices will also by used to do searching one index (or index group) at a time, improving search performance (since we use sort) and allowing us to do more fine-grained checkpoint and track progress (TBD). Relates elastic#42612
henningandersen
added a commit
to elastic/rally-tracks
that referenced
this issue
Jan 9, 2020
Add new challenge with reindex operation to http_logs track in order to verify reindex performance before and after resilient reindex implementation. Relates elastic/elasticsearch#42612
henningandersen
added a commit
to elastic/rally-tracks
that referenced
this issue
Jan 9, 2020
Add new challenge with reindex operation to http_logs track in order to verify reindex performance before and after resilient reindex implementation. Relates elastic/elasticsearch#42612
henningandersen
added a commit
to henningandersen/elasticsearch
that referenced
this issue
Jan 10, 2020
This adds support for rethrottling resilient reindex through updating the persistent task, ensuring that rethrottle sticks on failovers. Related to elastic#42612
henningandersen
added a commit
to henningandersen/elasticsearch
that referenced
this issue
Jan 29, 2020
This adds support for rethrottling resilient/persistent reindex through updating the .reindex index and notifying the task. This ensures that the new throttle value sticks on failovers while also ensuring that the task wakes up immediately if it had a very low throttle value. Related to elastic#42612
This was referenced Jan 29, 2020
This was referenced Feb 3, 2020
henningandersen
added a commit
that referenced
this issue
Feb 13, 2020
* Reindex rethrottle persistent task This adds support for rethrottling resilient/persistent reindex through updating the .reindex index and notifying the task. This ensures that the new throttle value sticks on failovers while also ensuring that the task wakes up immediately if it had a very low throttle value. Related to #42612
henningandersen
added a commit
to henningandersen/elasticsearch
that referenced
this issue
Feb 25, 2020
Added xcontent serialization tests for ReindexTaskStateDoc Related to elastic#42612 Depends on elastic#49278 (todo)
henningandersen
added a commit
that referenced
this issue
Mar 3, 2020
Reindex now uses last known good status on failure to ensure that counts continue from where it left when a node fails. Also added `persistentTaskId` to `StartReindexResponse` and removed a couple of obsolete todos. Relates #42612
rjernst
added
the
Team:Distributed (Obsolete)
Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.
label
May 4, 2020
henningandersen
added a commit
that referenced
this issue
Sep 1, 2022
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
:Distributed Indexing/Reindex
Issues relating to reindex that are not caused by issues further down
Meta
Team:Distributed (Obsolete)
Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.
We want to make reindex resilient to node restarts and failures, such that reindex can continue to run across such events.
There are two primary problems to solve:
Search resiliency
Coordinator node resiliency:
indices:data/write/start_reindex
indices:admin/reindex/start_reindex
cluster:admin/reindex/start_reindex
indices:data/reindex/start_reindex
Slicing:
Benchmarking:
Misc:
Docs
The text was updated successfully, but these errors were encountered: