-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Renew retention lease with the last known synced checkpoint #18
Conversation
@@ -138,7 +140,8 @@ class ShardReplicationTask(id: Long, type: String, action: String, description: | |||
rateLimiter.release() | |||
continue | |||
} | |||
retentionLeaseHelper.renewRetentionLease(remoteShardId, seqNo, followerShardId) | |||
//renew retention lease with global checkpoint so that any shard that picks up shard replication task has data until then. | |||
retentionLeaseHelper.renewRetentionLease(remoteShardId, indexShard.lastSyncedGlobalCheckpoint, followerShardId) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we using localCheckpoint at one place and GlobalCheckpoint at other ?
GlobalCheckpoint can lag behind localCheckpoint and give exception RetentionLeaseInvalidRetainingSeqNoException
in that case if we try to renew existing lease with lesser id
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack changed this to GCP
var seqNo = indexShard.localCheckpoint + 1 | ||
// Adding retention lease at local checkpoint of a node. This makes sure | ||
// new tasks spawned after node changes/shard movements are handled properly | ||
log.info("Adding retentionlease at follower Sequence number: ${indexShard.localCheckpoint}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: small s in "Sequence" word ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Please add a backlog item to add IT for the scenario.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Renew retention lease with the last known synced checkpoint
Renew retention lease with the last known synced checkpoint
Issues Resolved
when follower node which has primary shard for an index is down, a replica shard picks up the task. In this case previous code always used to add retention lease with -1 sequence number leading to replication failure.
This change makes sure that the last checkpoint on a follower node is used to add retention lease and hence replication will resume gracefully. Following logs when a node is down shows that replication is resumed gracefully
Logs in Follower primary shard's node when node was terminated
Logs in follower node which is selected as new primary where replication is resumed gracefully