Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Segment Replication] Fix flaky test testRelocateWhileContinuouslyIndexingAndWaitingForRefresh #6619

Closed

Conversation

Rishikesh1159
Copy link
Member

@Rishikesh1159 Rishikesh1159 commented Mar 10, 2023

Description

This PR fixes the flaky Test failing in SegmentReplicationRelocationIT.testRelocateWhileContinuouslyIndexingAndWaitingForRefresh. This PR is the best effort to reduce flakiness.

Issues Resolved

#6531

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

final List<ShardRouting> replicationTargets = indexShard.getReplicationGroup().getReplicationTargets();
final List<ShardRouting> replicationTargets;
try {
replicationTargets = indexShard.getReplicationGroup().getReplicationTargets();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Rishikesh1159 : Thanks for fixing this. Can you also please add more details around why this fix, reduces/fixes the flakyness here.

Copy link
Member Author

@Rishikesh1159 Rishikesh1159 Mar 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. During the process of relocation of primary, primary still publishes checkpoints on refresh even if relocation is in progress. So, during relocation when handoff is completed we switch the primarymode to false. So, this handoff process and publishing checkpoints to replicas happen parallely in our scenario.

Before publishing checkpoints to replicas we do call getReplicationGroup() on IndexShard. And this methods asserts on primarymode. So, very rarely we end up in a situation where primarymode of shard is switched between checkpointRefreshListeners afterRefresh check here and this getReplicationGroup() assert.

By throwing an exception in getReplicationGroup() when shard not in primary mode we are reducing flakiness. But again we immediately have assert after this primarymode check in getReplicationGroup(), If relocation handoff completes between these two lines and primarymode switches, we may again fail assert, so there is flakiness but will not happen often.

We are not concerned about primary sending checkpoints to replica here, we are only trying to reduce flakiness.

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@mch2
Copy link
Member

mch2 commented Mar 11, 2023

@Rishikesh1159 With the revert of the wait_until change, and the decision to not support that behavior with segrep, I think we should revert #6366 making this change obsolete. With that revert - #6637 is the only change we will need to release any wait_until reqs from nrt replicas.

@Rishikesh1159
Copy link
Member Author

@Rishikesh1159 With the revert of the wait_until change, and the decision to not support that behavior with segrep, I think we should revert #6366 making this change obsolete. With that revert - #6637 is the only change we will need to release any wait_until reqs from nrt replicas.

Sure @mch2, yes, if we revert changes made to PublishCheckpointAction class in #6366 then this PR and issue becomes obsolete. I will revert changes made to PublishCheckpointAction class by #6366, in PR : #6637

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants