-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Segment Replication] Fix flaky test testRelocateWhileContinuouslyIndexingAndWaitingForRefresh #6619
[Segment Replication] Fix flaky test testRelocateWhileContinuouslyIndexingAndWaitingForRefresh #6619
Conversation
Signed-off-by: Rishikesh1159 <[email protected]>
final List<ShardRouting> replicationTargets = indexShard.getReplicationGroup().getReplicationTargets(); | ||
final List<ShardRouting> replicationTargets; | ||
try { | ||
replicationTargets = indexShard.getReplicationGroup().getReplicationTargets(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Rishikesh1159 : Thanks for fixing this. Can you also please add more details around why this fix, reduces/fixes the flakyness here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. During the process of relocation of primary, primary still publishes checkpoints on refresh even if relocation is in progress. So, during relocation when handoff is completed we switch the primarymode to false. So, this handoff process and publishing checkpoints to replicas happen parallely in our scenario.
Before publishing checkpoints to replicas we do call getReplicationGroup() on IndexShard. And this methods asserts on primarymode. So, very rarely we end up in a situation where primarymode of shard is switched between checkpointRefreshListeners afterRefresh check here and this getReplicationGroup() assert.
By throwing an exception in getReplicationGroup() when shard not in primary mode we are reducing flakiness. But again we immediately have assert after this primarymode check in getReplicationGroup(), If relocation handoff completes between these two lines and primarymode switches, we may again fail assert, so there is flakiness but will not happen often.
We are not concerned about primary sending checkpoints to replica here, we are only trying to reduce flakiness.
Gradle Check (Jenkins) Run Completed with:
|
@Rishikesh1159 With the revert of the wait_until change, and the decision to not support that behavior with segrep, I think we should revert #6366 making this change obsolete. With that revert - #6637 is the only change we will need to release any wait_until reqs from nrt replicas. |
Sure @mch2, yes, if we revert changes made to PublishCheckpointAction class in #6366 then this PR and issue becomes obsolete. I will revert changes made to PublishCheckpointAction class by #6366, in PR : #6637 |
Gradle Check (Jenkins) Run Completed with:
|
Description
This PR fixes the flaky Test failing in
SegmentReplicationRelocationIT.testRelocateWhileContinuouslyIndexingAndWaitingForRefresh
. This PR is the best effort to reduce flakiness.Issues Resolved
#6531
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.