Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reindex resiliency #42612

Open
25 tasks
henningandersen opened this issue May 28, 2019 · 4 comments
Open
25 tasks

Reindex resiliency #42612

henningandersen opened this issue May 28, 2019 · 4 comments
Assignees
Labels
:Distributed Indexing/Reindex Issues relating to reindex that are not caused by issues further down Meta Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.

Comments

@henningandersen
Copy link
Contributor

henningandersen commented May 28, 2019

We want to make reindex resilient to node restarts and failures, such that reindex can continue to run across such events.

There are two primary problems to solve:

  • Data node resiliency. Reindex relies on scroll queries which are not resilient.
  • Coordinator node resiliency. Reindex runs on the host receiving the request and cannot survive if that node dies or is restarted.

Search resiliency

  • Search ordered by seq_no and handle query failures by retrying from last seq_no (inclusive)
  • Support reindex from remote when source version above 6.6+
  • Add support for alternative numeric ordering attribute, particularly useful for remote index against pre-6.5 source.
  • Back-off strategy on repeated failures
  • Verify overhead of seq_no ordering

Coordinator node resiliency:

  • POC to clarify this subject more (Make reindexing managed by a persistent task #43382)
  • Decide on start reindex job action name
    • indices:data/write/start_reindex
    • indices:admin/reindex/start_reindex
    • cluster:admin/reindex/start_reindex
    • indices:data/reindex/start_reindex
  • Decide on persistent reindex task name
  • Evaluate how we want to do timeouts for waiting on initial task creation or reindex task completion
  • Refactor common parts from data frames and roll-up
  • Add reindex persistent task and remove it when done (Make reindexing managed by a persistent task #43382)
  • Allocation of reindex persistent task (Make reindexing managed by a persistent task #43382)
  • Store progress information periodically into .tasks index
  • Resume from existing progress information when allocated to new node
  • Make updates to persistent tasks resilient against master failovers
  • Support async durability on destination, ensuring data in checkpoint is fsync'ed into destination

Slicing:

  • Investigate having multiple in flight search and bulk requests as an alternative

Benchmarking:

  • Compare rally original indexing to reindex
  • Overhead of scripting and ingest pipelines

Misc:

  • Handle write failures by retrying when appropriate
  • Refined error handling, filter out known/retryable errors
  • HLRC support for new persistent task id.
  • Examine if transport client in 7.x can call resilient reindex (workaround).
  • Add serialization tests for get reindex request

Docs

  • Clarify how to use resilient reindex in reference docs (conflict handling, parameters)
@henningandersen henningandersen added WIP Meta :Distributed Indexing/Reindex Issues relating to reindex that are not caused by issues further down labels May 28, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@ywelsch ywelsch added the 7x label May 28, 2019
@ywelsch ywelsch removed the WIP label May 29, 2019
henningandersen added a commit to henningandersen/elasticsearch that referenced this issue Jul 2, 2019
Refactor ScrollableHitSource to pump data out and have a simplified
interface (callers should no longer call startNextScroll, instead they
simply mark that they are done with the previous result, triggering a
new batch of data). This eases making reindex resilient, since we will
sometimes need to rerun search during retries.

Relates elastic#43187 and elastic#42612
@Tim-Brooks
Copy link
Contributor

I made some updates to the meta issues under coordinator node.

henningandersen added a commit that referenced this issue Jul 9, 2019
Refactor ScrollableHitSource to pump data out and have a simplified
interface (callers should no longer call startNextScroll, instead they
simply mark that they are done with the previous result, triggering a
new batch of data). This eases making reindex resilient, since we will
sometimes need to rerun search during retries.

Relates #43187 and #42612
henningandersen added a commit that referenced this issue Jul 9, 2019
Refactor ScrollableHitSource to pump data out and have a simplified
interface (callers should no longer call startNextScroll, instead they
simply mark that they are done with the previous result, triggering a
new batch of data). This eases making reindex resilient, since we will
sometimes need to rerun search during retries.

Relates #43187 and #42612
henningandersen added a commit to henningandersen/elasticsearch that referenced this issue Jul 11, 2019
The client and remote hit sources had each their own retry mechanism,
which would do the same. Supporting resiliency we would have to expand
on the retry mechanisms and as a preparation for that, the retry
mechanism is now shared such that each sub class is only responsible for
sending requests and converting responses/failures to common format.

Part of elastic#42612
Tim-Brooks added a commit that referenced this issue Jul 18, 2019
This is related to #42612. Currently the reindexing transport action
creates a task on the local coordinator node. Unfortunately this is not
resilient to coordinator node failures. This commit adds a new action
that creates a reindexing job as a persistent task.
Tim-Brooks added a commit to Tim-Brooks/elasticsearch that referenced this issue Aug 5, 2019
TransportStartReindexJobAction is currently a master action. Reindexing
only needs access to the cluster state to perform some validations.
Prior to persistent reindexing we used a normal data node to perform
these validations. There is not reason that these validations need a
perfectly update to date view of the cluster state. The commit changes
the action to be a normal transport action.

Relates to elastic#42612.
henningandersen added a commit that referenced this issue Aug 6, 2019
The client and remote hit sources had each their own retry mechanism,
which would do the same. Supporting resiliency we would have to expand
on the retry mechanisms and as a preparation for that, the retry
mechanism is now shared such that each sub class is only responsible for
sending requests and converting responses/failures to common format.

Part of #42612
Tim-Brooks added a commit that referenced this issue Aug 6, 2019
TransportStartReindexJobAction is currently a master action. Reindexing
only needs access to the cluster state to perform some validations.
Prior to persistent reindexing we used a normal data node to perform
these validations. There is not reason that these validations need a
perfectly update to date view of the cluster state. The commit changes
the action to be a normal transport action.

Relates to #42612.
henningandersen added a commit to henningandersen/elasticsearch that referenced this issue Aug 8, 2019
The client and remote hit sources had each their own retry mechanism,
which would do the same. Supporting resiliency we would have to expand
on the retry mechanisms and as a preparation for that, the retry
mechanism is now shared such that each sub class is only responsible for
sending requests and converting responses/failures to common format.

Part of elastic#42612
henningandersen added a commit that referenced this issue Aug 8, 2019
The client and remote hit sources had each their own retry mechanism,
which would do the same. Supporting resiliency we would have to expand
on the retry mechanisms and as a preparation for that, the retry
mechanism is now shared such that each sub class is only responsible for
sending requests and converting responses/failures to common format.

Part of #42612
henningandersen added a commit to henningandersen/elasticsearch that referenced this issue Aug 13, 2019
Local reindex can now survive loosing data nodes that contain source
data. The original query will be restarted with a filter for
`_seq_no >= last_seq_no` when a failure is detected.

Part of elastic#42612 and split out from elastic#43187
Tim-Brooks added a commit that referenced this issue Aug 14, 2019
Currently the result of a reindex persistent task is propogated and
stored in the cluster state. This commit changes this so that only the
ephemeral task-id, headers, and reindex state is store in the cluster
state. Any result (exception or response) is stored in the reindex
index.

Relates to #42612.
Tim-Brooks added a commit that referenced this issue Nov 20, 2019
This is related to #42612. This commit deletes TODOs related to test
assertions changing due to exception serialization changes. We have
determined that the exception x-content serialization was never stable
in the first place (reading from the task index), so we are okay with
these changes.
henningandersen added a commit to henningandersen/elasticsearch that referenced this issue Nov 22, 2019
Renamed types and action names to fit that we now call it a reindex task
and not a job. Removed action and named writeable todos.

Relates elastic#42612
henningandersen added a commit to henningandersen/rally-tracks that referenced this issue Nov 25, 2019
Add reindex operation to http_logs track in order to
verify reindex performance before and after resilient
reindex implementation as well as keep an eye on it
for the future.

Relates elastic/elasticsearch#42612
henningandersen added a commit that referenced this issue Nov 26, 2019
Ensure that reindex works in a mixed cluster state during rolling
upgrade by not doing resilient reindex until all nodes are on the new
version.

Relates #42612
henningandersen added a commit that referenced this issue Nov 27, 2019
Renamed types and action names to fit that we now call it a reindex task
and not a job. Removed action and named writeable todos.

Relates #42612
Tim-Brooks added a commit that referenced this issue Nov 27, 2019
This is related to #42612. It adds a setting to configure what headers are
stored by the persistent reindexing task for further requests. Additionally,
it has the x-pack security module automatically configure this setting to
ensure security works with reindexing.
henningandersen added a commit to henningandersen/elasticsearch that referenced this issue Dec 5, 2019
Resolve indices before starting to reindex. This ensures that the list
of indices does not change when failing over (TBD). The one exception to
this is aliases, which we still need to access through the alias.

In addition, resolved index patterns are sorted by create-date and
otherwise the listed order is preserved. This ensures that once we
reindex one index at a time, we will get reasonable time locality for
time based indices.

The resolved list of indices will also by used to do searching one
index (or index group) at a time, improving search performance (since we
use sort) and allowing us to do more fine-grained checkpoint and track
progress (TBD).

Relates elastic#42612
@polyfractal polyfractal removed the 7x label Dec 12, 2019
henningandersen added a commit to elastic/rally-tracks that referenced this issue Jan 9, 2020
Add new challenge with reindex operation to http_logs track
in order to verify reindex performance before and after resilient
reindex implementation.

Relates elastic/elasticsearch#42612
henningandersen added a commit to elastic/rally-tracks that referenced this issue Jan 9, 2020
Add new challenge with reindex operation to http_logs track
in order to verify reindex performance before and after resilient
reindex implementation.

Relates elastic/elasticsearch#42612
henningandersen added a commit to henningandersen/elasticsearch that referenced this issue Jan 10, 2020
This adds support for rethrottling resilient reindex through updating
the persistent task, ensuring that rethrottle sticks on failovers.

Related to elastic#42612
henningandersen added a commit to henningandersen/elasticsearch that referenced this issue Jan 29, 2020
This adds support for rethrottling resilient/persistent reindex
through updating the .reindex index and notifying the task. This
ensures that the new throttle value sticks on failovers while
also ensuring that the task wakes up immediately if it had a very
low throttle value.

Related to elastic#42612
henningandersen added a commit that referenced this issue Feb 13, 2020
* Reindex rethrottle persistent task

This adds support for rethrottling resilient/persistent reindex
through updating the .reindex index and notifying the task. This
ensures that the new throttle value sticks on failovers while
also ensuring that the task wakes up immediately if it had a very
low throttle value.

Related to #42612
henningandersen added a commit to henningandersen/elasticsearch that referenced this issue Feb 25, 2020
Added xcontent serialization tests for ReindexTaskStateDoc

Related to elastic#42612
Depends on elastic#49278 (todo)
henningandersen added a commit that referenced this issue Mar 3, 2020
Reindex now uses last known good status on failure to ensure that counts
continue from where it left when a node fails.

Also added `persistentTaskId` to `StartReindexResponse` and removed a
couple of obsolete todos.

Relates #42612
@rjernst rjernst added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label May 4, 2020
henningandersen added a commit that referenced this issue Sep 1, 2022
Added xcontent serialization tests for ReindexTaskStateDoc

Related to #42612
Depends on #49278 (todo)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/Reindex Issues relating to reindex that are not caused by issues further down Meta Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.
Projects
None yet
Development

No branches or pull requests

7 participants