Reindex resiliency #42612

henningandersen · 2019-05-28T08:15:30Z

elasticmachine · 2019-05-28T08:15:31Z

Pinging @elastic/es-distributed

Refactor ScrollableHitSource to pump data out and have a simplified interface (callers should no longer call startNextScroll, instead they simply mark that they are done with the previous result, triggering a new batch of data). This eases making reindex resilient, since we will sometimes need to rerun search during retries. Relates elastic#43187 and elastic#42612

Tim-Brooks · 2019-07-06T00:34:20Z

I made some updates to the meta issues under coordinator node.

Refactor ScrollableHitSource to pump data out and have a simplified interface (callers should no longer call startNextScroll, instead they simply mark that they are done with the previous result, triggering a new batch of data). This eases making reindex resilient, since we will sometimes need to rerun search during retries. Relates #43187 and #42612

The client and remote hit sources had each their own retry mechanism, which would do the same. Supporting resiliency we would have to expand on the retry mechanisms and as a preparation for that, the retry mechanism is now shared such that each sub class is only responsible for sending requests and converting responses/failures to common format. Part of elastic#42612

This is related to #42612. Currently the reindexing transport action creates a task on the local coordinator node. Unfortunately this is not resilient to coordinator node failures. This commit adds a new action that creates a reindexing job as a persistent task.

TransportStartReindexJobAction is currently a master action. Reindexing only needs access to the cluster state to perform some validations. Prior to persistent reindexing we used a normal data node to perform these validations. There is not reason that these validations need a perfectly update to date view of the cluster state. The commit changes the action to be a normal transport action. Relates to elastic#42612.

The client and remote hit sources had each their own retry mechanism, which would do the same. Supporting resiliency we would have to expand on the retry mechanisms and as a preparation for that, the retry mechanism is now shared such that each sub class is only responsible for sending requests and converting responses/failures to common format. Part of #42612

TransportStartReindexJobAction is currently a master action. Reindexing only needs access to the cluster state to perform some validations. Prior to persistent reindexing we used a normal data node to perform these validations. There is not reason that these validations need a perfectly update to date view of the cluster state. The commit changes the action to be a normal transport action. Relates to #42612.

The client and remote hit sources had each their own retry mechanism, which would do the same. Supporting resiliency we would have to expand on the retry mechanisms and as a preparation for that, the retry mechanism is now shared such that each sub class is only responsible for sending requests and converting responses/failures to common format. Part of elastic#42612

The client and remote hit sources had each their own retry mechanism, which would do the same. Supporting resiliency we would have to expand on the retry mechanisms and as a preparation for that, the retry mechanism is now shared such that each sub class is only responsible for sending requests and converting responses/failures to common format. Part of #42612

Local reindex can now survive loosing data nodes that contain source data. The original query will be restarted with a filter for `_seq_no >= last_seq_no` when a failure is detected. Part of elastic#42612 and split out from elastic#43187

Currently the result of a reindex persistent task is propogated and stored in the cluster state. This commit changes this so that only the ephemeral task-id, headers, and reindex state is store in the cluster state. Any result (exception or response) is stored in the reindex index. Relates to #42612.

This is related to #42612. This commit deletes TODOs related to test assertions changing due to exception serialization changes. We have determined that the exception x-content serialization was never stable in the first place (reading from the task index), so we are okay with these changes.

Renamed types and action names to fit that we now call it a reindex task and not a job. Removed action and named writeable todos. Relates elastic#42612

Add reindex operation to http_logs track in order to verify reindex performance before and after resilient reindex implementation as well as keep an eye on it for the future. Relates elastic/elasticsearch#42612

Ensure that reindex works in a mixed cluster state during rolling upgrade by not doing resilient reindex until all nodes are on the new version. Relates #42612

Renamed types and action names to fit that we now call it a reindex task and not a job. Removed action and named writeable todos. Relates #42612

This is related to #42612. It adds a setting to configure what headers are stored by the persistent reindexing task for further requests. Additionally, it has the x-pack security module automatically configure this setting to ensure security works with reindexing.

Resolve indices before starting to reindex. This ensures that the list of indices does not change when failing over (TBD). The one exception to this is aliases, which we still need to access through the alias. In addition, resolved index patterns are sorted by create-date and otherwise the listed order is preserved. This ensures that once we reindex one index at a time, we will get reasonable time locality for time based indices. The resolved list of indices will also by used to do searching one index (or index group) at a time, improving search performance (since we use sort) and allowing us to do more fine-grained checkpoint and track progress (TBD). Relates elastic#42612

Add new challenge with reindex operation to http_logs track in order to verify reindex performance before and after resilient reindex implementation. Relates elastic/elasticsearch#42612

This adds support for rethrottling resilient reindex through updating the persistent task, ensuring that rethrottle sticks on failovers. Related to elastic#42612

This adds support for rethrottling resilient/persistent reindex through updating the .reindex index and notifying the task. This ensures that the new throttle value sticks on failovers while also ensuring that the task wakes up immediately if it had a very low throttle value. Related to elastic#42612

* Reindex rethrottle persistent task This adds support for rethrottling resilient/persistent reindex through updating the .reindex index and notifying the task. This ensures that the new throttle value sticks on failovers while also ensuring that the task wakes up immediately if it had a very low throttle value. Related to #42612

Added xcontent serialization tests for ReindexTaskStateDoc Related to elastic#42612 Depends on elastic#49278 (todo)

Reindex now uses last known good status on failure to ensure that counts continue from where it left when a node fails. Also added `persistentTaskId` to `StartReindexResponse` and removed a couple of obsolete todos. Relates #42612

Added xcontent serialization tests for ReindexTaskStateDoc Related to #42612 Depends on #49278 (todo)

henningandersen added WIP Meta :Distributed Indexing/Reindex Issues relating to reindex that are not caused by issues further down labels May 28, 2019

ywelsch added the 7x label May 28, 2019

ywelsch assigned Tim-Brooks and henningandersen May 29, 2019

ywelsch removed the WIP label May 29, 2019

Tim-Brooks mentioned this issue Jun 19, 2019

Make reindexing managed by a persistent task #43382

Merged

gwbrown mentioned this issue Jun 27, 2019

Add _reindex to ILM #42784

Open

henningandersen mentioned this issue Jul 2, 2019

Reindex ScrollableHitSource pump data out #43864

Merged

henningandersen mentioned this issue Jul 11, 2019

Reindex share retry between hit sources #44203

Merged

Tim-Brooks mentioned this issue Aug 5, 2019

Make StartReindexJobAction a non-master action #45217

Merged

Tim-Brooks mentioned this issue Aug 6, 2019

Store reindexing result in reindex index #45260

Merged

henningandersen mentioned this issue Aug 8, 2019

Reindex share retry between hit sources (#44203) #45348

Merged

henningandersen mentioned this issue Aug 13, 2019

Reindex search resiliency #45497

Merged

henningandersen added a commit to henningandersen/elasticsearch that referenced this issue Nov 22, 2019

Reindex naming changes

281549f

Renamed types and action names to fit that we now call it a reindex task and not a job. Removed action and named writeable todos. Relates elastic#42612

henningandersen mentioned this issue Nov 22, 2019

Reindex naming changes #49482

Merged

henningandersen mentioned this issue Nov 25, 2019

Add reindex operation elastic/rally-tracks#92

Merged

henningandersen added a commit that referenced this issue Nov 26, 2019

Reindex in mixed cluster (#49199)

f578d58

Ensure that reindex works in a mixed cluster state during rolling upgrade by not doing resilient reindex until all nodes are on the new version. Relates #42612

henningandersen added a commit that referenced this issue Nov 27, 2019

Reindex naming changes (#49482)

375f3d8

Renamed types and action names to fit that we now call it a reindex task and not a job. Removed action and named writeable todos. Relates #42612

henningandersen mentioned this issue Dec 5, 2019

Reindex resolve indices early #49850

Closed

$@polyfractal$ polyfractal removed the 7x label Dec 12, 2019

henningandersen mentioned this issue Jan 10, 2020

Reindex rethrottle through persistent task #50854

Closed

This was referenced Jan 29, 2020

Reindex rethrottle persistent task #51599

Merged

Task Management Experimental Status #51628

Open

This was referenced Feb 3, 2020

[meta] 7.6 release elastic/elasticsearch-net#4340

Closed

[meta] 7.6 release elastic/elasticsearch-net#4341

Closed

henningandersen added a commit to henningandersen/elasticsearch that referenced this issue Feb 25, 2020

Add ReindexTaskStateDoc tests

a3aa021

Added xcontent serialization tests for ReindexTaskStateDoc Related to elastic#42612 Depends on elastic#49278 (todo)

henningandersen mentioned this issue Feb 25, 2020

Add ReindexTaskStateDoc tests #52767

Merged

rjernst added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label May 4, 2020

henningandersen mentioned this issue Aug 5, 2020

Add support for retries in Reindex API #60362

Closed

henningandersen added a commit that referenced this issue Sep 1, 2022

Add ReindexTaskStateDoc tests (#52767)

02313a0

Added xcontent serialization tests for ReindexTaskStateDoc Related to #42612 Depends on #49278 (todo)

maxhniebergall mentioned this issue Oct 2, 2024

[ML] Overview of reindex issues with NLP #113948

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reindex resiliency #42612

Reindex resiliency #42612

henningandersen commented May 28, 2019 •

edited by Tim-Brooks

Loading

elasticmachine commented May 28, 2019

Tim-Brooks commented Jul 6, 2019

Reindex resiliency #42612

Reindex resiliency #42612

Comments

henningandersen commented May 28, 2019 • edited by Tim-Brooks Loading

elasticmachine commented May 28, 2019

Tim-Brooks commented Jul 6, 2019

henningandersen commented May 28, 2019 •

edited by Tim-Brooks

Loading