Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sequence number based replica allocation #46959

Merged
merged 14 commits into from
Oct 13, 2019
Merged

Conversation

dnhatn
Copy link
Member

@dnhatn dnhatn commented Sep 23, 2019

This change prefers allocating replicas on nodes where it can perform an operation-based recovery or has sync_id match to reduce recovery time. We no longer need to perform a synced_flush in a rolling upgrade or full cluster start with this improvement.

I started with an implementation where I used the persisted global checkpoint from replicas and peer recovery retention leases from primaries to make decisions. However, I was not happy with the extension capturing the persisted global checkpoint (see https://github.com/elastic/elasticsearch/compare/master...dnhatn:replica-allocator-with-gcp?expand=1#diff-275151cc4a5cdf942f310a219e86a403R485).

We don't need the global checkpoint to make decisions for open indices. Having a peer recovery retention lease alone is enough to guarantee to have an operation-based recovery since we share the persisted global checkpoint between copies. I decided to implement this without the global checkpoint.

I prefer to support closed/frozen indices in a follow-up after we agree on the approach (i.e., using the global checkpoint or the last commit).

Closes #46318

@dnhatn dnhatn added >enhancement :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) v8.0.0 v7.5.0 labels Sep 23, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this direction looks good. I have left a few initial comments inline, but my main concern is staleness:

I think that the info we have from primary about leases can be stale. As soon as a node with a replica dies, we will reach out to all nodes including primary and read the info. And then cache it until a shard with same shard-id is started. Given the index.recovery.file_based_threshold, the staleness may become important in much less time than the default 12h lease expiration.

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great stuff @dnhatn, thanks. I left some points for discussion.

dnhatn added a commit that referenced this pull request Sep 28, 2019
Today, we don't clear the shard info of the primary shard when a new
node joins; then we might risk of making replica allocation decisions
based on the stale information of the primary. The serious problem is
that we can cancel the current recovery which is more advanced than the
copy on the new node due to the old info we have from the primary.

With this change, we ensure the shard info from the primary is not older
than any node when allocating replicas.

Relates #46959

This work was done by Henning in #42518.
Co-authored-by: Henning Andersen <[email protected]>
@dnhatn
Copy link
Member Author

dnhatn commented Oct 1, 2019

@henningandersen @DaveCTurner Thank you for reviewing. I have addressed your comments and suggestions in 17bfb34. Would you please take another look?

dnhatn added a commit that referenced this pull request Oct 2, 2019
Today, we don't clear the shard info of the primary shard when a new
node joins; then we might risk of making replica allocation decisions
based on the stale information of the primary. The serious problem is
that we can cancel the current recovery which is more advanced than the
copy on the new node due to the old info we have from the primary.

With this change, we ensure the shard info from the primary is not older
than any node when allocating replicas.

Relates #46959

This work was done by Henning in #42518.

Co-authored-by: Henning Andersen <[email protected]>
Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've taken a look to see how this works and left mainly smaller comments. Good stuff.

Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dnhatn, this is looking good. I left a number of smaller comments to address or comment on.

@dnhatn
Copy link
Member Author

dnhatn commented Oct 9, 2019

@ywelsch @henningandersen Thank you for another helpful review. I have responded/addressed your comments. Would you please take another look?

@dnhatn
Copy link
Member Author

dnhatn commented Oct 10, 2019

Although the failure is from a newly introduced test, I think this PR is still ready for another round. I am investigating the test failure.

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Nhat!

Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Thanks @dnhatn

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dnhatn and apologies for the delayed review. I left a few questions, but no blockers.

@dnhatn
Copy link
Member Author

dnhatn commented Oct 13, 2019

@DaveCTurner Thanks for looking. I have addressed your comments. Would you please take another look?

@dnhatn dnhatn requested a review from DaveCTurner October 13, 2019 02:26
Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks @dnhatn

@dnhatn
Copy link
Member Author

dnhatn commented Oct 13, 2019

@henningandersen @DaveCTurner @ywelsch Thank you very much for your helpful reviews.

@dnhatn dnhatn merged commit e628f35 into elastic:master Oct 13, 2019
@dnhatn dnhatn deleted the replica-allocator branch October 13, 2019 16:58
dnhatn added a commit that referenced this pull request Oct 14, 2019
With this change, shard allocation prefers allocating replicas on a node
that already has a copy of the shard that is as close as possible to the
primary, so that it is as cheap as possible to bring the new replica in
sync with the primary. Furthermore, if we find a copy that is identical
to the primary then we cancel an ongoing recovery because the new copy
which is identical to the primary needs no work to recover as a replica.

We no longer need to perform a synced flush before performing a rolling
upgrade or full cluster start with this improvement.

Closes #46318
dnhatn added a commit that referenced this pull request Oct 14, 2019
howardhuanghua pushed a commit to TencentCloudES/elasticsearch that referenced this pull request Oct 14, 2019
With this change, shard allocation prefers allocating replicas on a node 
that already has a copy of the shard that is as close as possible to the
primary, so that it is as cheap as possible to bring the new replica in
sync with the primary. Furthermore, if we find a copy that is identical
to the primary then we cancel an ongoing recovery because the new copy
which is identical to the primary needs no work to recover as a replica.

We no longer need to perform a synced flush before performing a rolling 
upgrade or full cluster start with this improvement.

Closes elastic#46318
howardhuanghua pushed a commit to TencentCloudES/elasticsearch that referenced this pull request Oct 14, 2019
dnhatn added a commit that referenced this pull request Dec 20, 2019
…50351)

Today, the replica allocator uses peer recovery retention leases to 
select the best-matched copies when allocating replicas of indices with
soft-deletes. We can employ this mechanism for indices without
soft-deletes because the retaining sequence number of a PRRL is the
persisted global checkpoint (plus one) of that copy. If the primary and 
replica have the same retaining sequence number, then we should be able
to perform a noop recovery. The reason is that we must be retaining
translog up to the local checkpoint of the safe commit, which is at most
the global checkpoint of either copy). The only limitation is that we
might not cancel ongoing file-based recoveries with PRRLs for noop
recoveries. We can't make the translog retention policy comply with
PRRLs. We also have this problem with soft-deletes if a PRRL is about to
expire.

Relates #45136
Relates #46959
dnhatn added a commit that referenced this pull request Dec 24, 2019
…50351)

Today, the replica allocator uses peer recovery retention leases to
select the best-matched copies when allocating replicas of indices with
soft-deletes. We can employ this mechanism for indices without
soft-deletes because the retaining sequence number of a PRRL is the
persisted global checkpoint (plus one) of that copy. If the primary and
replica have the same retaining sequence number, then we should be able
to perform a noop recovery. The reason is that we must be retaining
translog up to the local checkpoint of the safe commit, which is at most
the global checkpoint of either copy). The only limitation is that we
might not cancel ongoing file-based recoveries with PRRLs for noop
recoveries. We can't make the translog retention policy comply with
PRRLs. We also have this problem with soft-deletes if a PRRL is about to
expire.

Relates #45136
Relates #46959
SivagurunathanV pushed a commit to SivagurunathanV/elasticsearch that referenced this pull request Jan 23, 2020
…lastic#50351)

Today, the replica allocator uses peer recovery retention leases to 
select the best-matched copies when allocating replicas of indices with
soft-deletes. We can employ this mechanism for indices without
soft-deletes because the retaining sequence number of a PRRL is the
persisted global checkpoint (plus one) of that copy. If the primary and 
replica have the same retaining sequence number, then we should be able
to perform a noop recovery. The reason is that we must be retaining
translog up to the local checkpoint of the safe commit, which is at most
the global checkpoint of either copy). The only limitation is that we
might not cancel ongoing file-based recoveries with PRRLs for noop
recoveries. We can't make the translog retention policy comply with
PRRLs. We also have this problem with soft-deletes if a PRRL is about to
expire.

Relates elastic#45136
Relates elastic#46959
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) >enhancement v7.5.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Sequence-number-based replica allocation
6 participants