Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Renew retention lease with the last known synced checkpoint #18

Merged
merged 4 commits into from
Jun 23, 2021

Conversation

tbhanu-amzn
Copy link
Contributor

Renew retention lease with the last known synced checkpoint

Renew retention lease with the last known synced checkpoint

Issues Resolved

when follower node which has primary shard for an index is down, a replica shard picks up the task. In this case previous code always used to add retention lease with -1 sequence number leading to replication failure.

This change makes sure that the last checkpoint on a follower node is used to add retention lease and hence replication will resume gracefully. Following logs when a node is down shows that replication is resumed gracefully

Logs in Follower primary shard's node when node was terminated

[2021-06-18T08:35:48,944][INFO ][c.a.e.r.t.s.ShardReplicationTask] [node2] [test_index][0] Got 396 changes starting from seqNo: 247105
[2021-06-18T08:35:48,944][INFO ][c.a.e.r.t.s.ShardReplicationTask] [node2] [test_index][0] Renewing retentionlease of follower global check point: 244539
Connection to ec2-34-241-98-254.eu-west-1.compute.amazonaws.com closed by remote host.
Connection to ec2-34-241-98-254.eu-west-1.compute.amazonaws.com closed.

Logs in follower node which is selected as new primary where replication is resumed gracefully

[2021-06-18T08:35:49,824][INFO ][c.a.e.r.t.s.ShardReplicationExecutor] [node3] starting persistent replication task: {"remote_cluster":"leader-cluster-1node","remote_shard":"[test_index][0]","remote_index_uuid":"Gn6nr07ZQ3mg_lFvQ4SZ1w","follower_shard":"[test_index][0]","follower_index_uuid":"6TolcBpWSmuWSUNyQFMMMA"}, com.amazon.elasticsearch.replication.task.shard.FollowingState@4dbd6226, 5, {"state":"STARTED"}
[2021-06-18T08:35:49,863][ERROR][c.a.e.r.s.RemoteClusterRetentionLeaseHelper] [node3] retention lease with ID [replication:follower-cluster-node:[test_index][0]] already exists
[2021-06-18T08:35:49,864][INFO ][c.a.e.r.s.RemoteClusterRetentionLeaseHelper] [node3] Renew retention lease as it already exists replication:follower-cluster-node:[test_index][0] with 247500
[2021-06-18T08:35:49,866][INFO ][c.a.e.r.t.s.ShardReplicationTask] [node3] [test_index][0] Adding retentionlease of follower global check point: 247500
[2021-06-18T08:35:49,866][INFO ][c.a.e.r.t.s.ShardReplicationTask] [node3] [test_index][0] Follower Global check point is: 247500
[2021-06-18T08:35:49,866][INFO ][c.a.e.r.t.s.ShardReplicationTask] [node3] [test_index][0] Index local check point is : 247500
[2021-06-18T08:35:50,599][INFO ][c.a.e.r.t.s.ShardReplicationTask] [node3] [test_index][0] Got 513 changes starting from seqNo: 247501
[2021-06-18T08:35:50,600][INFO ][c.a.e.r.t.s.ShardReplicationTask] [node3] [test_index][0] Renewing retentionlease of follower global check point: 247500

@@ -138,7 +140,8 @@ class ShardReplicationTask(id: Long, type: String, action: String, description:
rateLimiter.release()
continue
}
retentionLeaseHelper.renewRetentionLease(remoteShardId, seqNo, followerShardId)
//renew retention lease with global checkpoint so that any shard that picks up shard replication task has data until then.
retentionLeaseHelper.renewRetentionLease(remoteShardId, indexShard.lastSyncedGlobalCheckpoint, followerShardId)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we using localCheckpoint at one place and GlobalCheckpoint at other ?

GlobalCheckpoint can lag behind localCheckpoint and give exception RetentionLeaseInvalidRetainingSeqNoException in that case if we try to renew existing lease with lesser id

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack changed this to GCP

var seqNo = indexShard.localCheckpoint + 1
// Adding retention lease at local checkpoint of a node. This makes sure
// new tasks spawned after node changes/shard movements are handled properly
log.info("Adding retentionlease at follower Sequence number: ${indexShard.localCheckpoint}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: small s in "Sequence" word ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

gbbafna
gbbafna previously approved these changes Jun 23, 2021
Copy link
Collaborator

@gbbafna gbbafna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Please add a backlog item to add IT for the scenario.

naveenpajjuri
naveenpajjuri previously approved these changes Jun 23, 2021
Copy link
Contributor

@naveenpajjuri naveenpajjuri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tbhanu-amzn tbhanu-amzn dismissed stale reviews from naveenpajjuri and gbbafna via a90bd17 June 23, 2021 10:52
Copy link
Contributor

@naveenpajjuri naveenpajjuri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tbhanu-amzn tbhanu-amzn merged commit 819107a into main Jun 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants