Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replay history of operations in remote recovery #39153

Closed
wants to merge 33 commits into from

Conversation

dnhatn
Copy link
Member

@dnhatn dnhatn commented Feb 19, 2019

The safe commit invariant does not hold for the following indices because we do not replay the history in remote recovery.

Relates #39000
Relates #35975

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@dnhatn
Copy link
Member Author

dnhatn commented Feb 20, 2019

run elasticsearch-ci/default-distro

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dnhatn. I've had an initial look and this looks already great. I'm not super happy with restoreShard returning a Translog.Snapshot, as it's now up to the caller to complete building the shard but also see that it's tricky to otherwise keep information about session etc. local to the repository implementation.

* to the max_seq_no. Then we won't have a safe commit for the restoring commit is not safe (missing translog).
* To maintain the safe commit assumption, we have to forcefully flush a new commit here.
*/
indexShard.flush(new FlushRequest().force(true).waitIfOngoing(true));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we make sure that the global checkpoint is up-to-date (i.e. >= max-seq-no in the new commit that we create here) before calling this? Otherwise the shard will be marked as in-sync in the cluster state while the commit here will only become safe commit when the shard is locally started (and the gcp advanced). The main property we're after here is that every in-sync shard copy has a safe commit, which (AFAICS) is not guaranteed by the current recovery logic.

*/
public static void assertSafeCommitExists(IndexShard shard) throws IOException {
try {
if (shard.state != IndexShardState.STARTED) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a shard in POST_RECOVERY should also satisfy this property already, see my comment above

"checkpoint=" + newCommitInfo.localCheckpoint + " max_seq_no=" + newCommitInfo.maxSeqNo);
}
}
}
indexShard.finalizeRecovery();
indexShard.postRecovery("restore done");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we assert in post_recovery that we have a safe commit?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added an assertion here, but this assertion may slow down our tests. I will post the difference.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sadly, we don't have this invariant with closed indices because the global checkpoint is not persisted to the translog checkpoint during recovery.

@dnhatn
Copy link
Member Author

dnhatn commented Mar 11, 2019

Discussed with Yannick on another channel, I marked this as WIP since we need to make broader (maybe unrelated) changes to this PR to achieve the safe commit invariant.

@dnhatn
Copy link
Member Author

dnhatn commented Apr 3, 2019

We now can achieve the safe commit invariant. I will open smaller pull requests for unrelated changes.

@dnhatn dnhatn removed the WIP label Apr 5, 2019
@dnhatn dnhatn requested a review from ywelsch April 5, 2019 03:10
@dnhatn
Copy link
Member Author

dnhatn commented Apr 5, 2019

@ywelsch This is ready for another round. Can you please give it another go. Thank you!

@dnhatn
Copy link
Member Author

dnhatn commented May 16, 2019

I am closing this PR as the safe commit invariant does not hold for follower indices.

@dnhatn dnhatn closed this May 16, 2019
@dnhatn dnhatn deleted the remote-recovery-history branch May 16, 2019 20:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/CCR Issues around the Cross Cluster State Replication features >enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants