Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Per doc replica rollbacks #31637

Closed
11 tasks
dnhatn opened this issue Jun 27, 2018 · 2 comments
Closed
11 tasks

Per doc replica rollbacks #31637

dnhatn opened this issue Jun 27, 2018 · 2 comments
Labels
:Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. Meta stalled

Comments

@dnhatn
Copy link
Member

dnhatn commented Jun 27, 2018

When a replica switches to follow/recover from a new primary, it may have indexed some operations that do not exist on the new primary. We need to undo those operations. This meta issue tracks works to be done on this story.

  • Use an updatable numeric docValues to record the sequence number of an updating operation (Retain soft-deleted documents for rollback #31846)

  • Do we need an optimized version of the soft-deletes retention merge policy? I am not sure if we can avoid the slowness of the range query of a numeric docValues with the builtin soft-deletes retention merge policy.

  • Handle duplicate and nested documents. With the current implementation, if a nested doc arrives twice, we will index the first nested doc as live and the second as deleted. Suppose we delete that doc, then subsequently would like to un-delete. We have no way to un-delete only the child docs of the first nested doc. We can solve this by not index the second doc (Do not add multiple copies of stale docs to Lucene #31806).

  • Provide a capacitiy to rollback a single document with the specific seqno/term in both Lucene and VersionMap (Add primitive method allows rollback a single operation #31910)

  • Capture the maximum of max_seqno on all active replicas before resync, then issue no-ops on the primary for every seqno from its max_seqno to the max of max_seqno on replicas. Another option is to let replicas locally rollback operations whose seq# >= max_seqno of the primary.

  • Live Lucene per doc rollback on the replica during the primary-replica resync using the single doc rollback method. This is the main task of this story.

Live VersionMap rollback on the replica during the primary-replica resync. If we replace the tombstone map with Lucene soft-deletes, we may not need to rollback the VersionMap because a refresh will flush all entries in the version map.

Benchmarking

  • We need to benchmark to make sure this change does not reduce the indexing throughput or slow down a refresh/merge significantly.

Removing the safe commit

Currently, a primary and replica roll back to the safe commit before executing a store or peer recovery respectively. We can achieve the same thing with the last commit and per-doc rollback.

  • A replica can start a peer-recovery with the last commit (instead of the safe commit), then apply per-doc rollback to that commit before phase2 if the recovery is an operation-based.

  • With this change, a primary does not have to transfer the safe commit in the peer-recovery, we can keep only the last commit on the primary

  • Once a replica uses the last commit in a peer-recovery, we can keep only the last commit on the replica

Misc

  • A formal model to prove the correctness of the Lucene rollback

Relates to #10708

@dnhatn dnhatn added :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. Meta :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. labels Jun 27, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@dnhatn
Copy link
Member Author

dnhatn commented Mar 7, 2019

I am closing this issue for it is not feasible for now.

@dnhatn dnhatn closed this as completed Mar 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. Meta stalled
Projects
None yet
Development

No branches or pull requests

2 participants