[Segment Replication] Fix flaky test testRelocateWhileContinuouslyIndexingAndWaitingForRefresh #6619

Rishikesh1159 · 2023-03-10T20:54:48Z

Description

This PR fixes the flaky Test failing in SegmentReplicationRelocationIT.testRelocateWhileContinuouslyIndexingAndWaitingForRefresh. This PR is the best effort to reduce flakiness.

Issues Resolved

#6531

Check List

New functionality includes testing.
- All tests pass
New functionality has been documented.
- New functionality has javadoc added
Commits are signed per the DCO using --signoff
Commit changes are listed out in CHANGELOG.md file (See: Changelog)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Rishikesh1159 <[email protected]>

dreamer-89 · 2023-03-10T21:22:14Z

server/src/main/java/org/opensearch/indices/replication/checkpoint/PublishCheckpointAction.java

-            final List<ShardRouting> replicationTargets = indexShard.getReplicationGroup().getReplicationTargets();
+            final List<ShardRouting> replicationTargets;
+            try {
+                replicationTargets = indexShard.getReplicationGroup().getReplicationTargets();


@Rishikesh1159 : Thanks for fixing this. Can you also please add more details around why this fix, reduces/fixes the flakyness here.

Sure. During the process of relocation of primary, primary still publishes checkpoints on refresh even if relocation is in progress. So, during relocation when handoff is completed we switch the primarymode to false. So, this handoff process and publishing checkpoints to replicas happen parallely in our scenario.

Before publishing checkpoints to replicas we do call getReplicationGroup() on IndexShard. And this methods asserts on primarymode. So, very rarely we end up in a situation where primarymode of shard is switched between checkpointRefreshListeners afterRefresh check here and this getReplicationGroup() assert.

By throwing an exception in getReplicationGroup() when shard not in primary mode we are reducing flakiness. But again we immediately have assert after this primarymode check in getReplicationGroup(), If relocation handoff completes between these two lines and primarymode switches, we may again fail assert, so there is flakiness but will not happen often.

We are not concerned about primary sending checkpoints to replica here, we are only trying to reduce flakiness.

github-actions · 2023-03-10T22:02:45Z

Gradle Check (Jenkins) Run Completed with:

RESULT: FAILURE ❌
URL: https://build.ci.opensearch.org/job/gradle-check/12249/
CommitID: 926bd84
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green.
Is the failure a flaky test unrelated to your change?

mch2 · 2023-03-11T23:26:33Z

@Rishikesh1159 With the revert of the wait_until change, and the decision to not support that behavior with segrep, I think we should revert #6366 making this change obsolete. With that revert - #6637 is the only change we will need to release any wait_until reqs from nrt replicas.

Rishikesh1159 · 2023-03-11T23:49:37Z

@Rishikesh1159 With the revert of the wait_until change, and the decision to not support that behavior with segrep, I think we should revert #6366 making this change obsolete. With that revert - #6637 is the only change we will need to release any wait_until reqs from nrt replicas.

Sure @mch2, yes, if we revert changes made to PublishCheckpointAction class in #6366 then this PR and issue becomes obsolete. I will revert changes made to PublishCheckpointAction class by #6366, in PR : #6637

github-actions · 2023-03-12T00:30:46Z

Gradle Check (Jenkins) Run Completed with:

RESULT: FAILURE ❌
URL: https://build.ci.opensearch.org/job/gradle-check/12299/
CommitID: 926bd84
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green.
Is the failure a flaky test unrelated to your change?

Rishikesh1159 and others added 2 commits March 10, 2023 20:49

Fix flaky test

87d6618

Signed-off-by: Rishikesh1159 <[email protected]>

Merge branch 'opensearch-project:main' into bug-flaky-test

926bd84

Rishikesh1159 requested review from reta, anasalkouz, andrross, Bukhtawar, CEHENKLE, dblock, gbbafna, setiah, kartg, kotwanikunal, mch2, nknize, owaiskazi19, adnapibar, ryanbogan, saratvemulapalli, shwetathareja, dreamer-89, tlfeng, VachaShah and xuezhou25 as code owners March 10, 2023 20:54

Rishikesh1159 added the skip-changelog label Mar 10, 2023

dreamer-89 reviewed Mar 10, 2023

View reviewed changes

Rishikesh1159 closed this Mar 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Segment Replication] Fix flaky test testRelocateWhileContinuouslyIndexingAndWaitingForRefresh #6619

[Segment Replication] Fix flaky test testRelocateWhileContinuouslyIndexingAndWaitingForRefresh #6619

Rishikesh1159 commented Mar 10, 2023 •

edited

Loading

dreamer-89 Mar 10, 2023

Rishikesh1159 Mar 10, 2023 •

edited

Loading

github-actions bot commented Mar 10, 2023

mch2 commented Mar 11, 2023

Rishikesh1159 commented Mar 11, 2023

github-actions bot commented Mar 12, 2023

[Segment Replication] Fix flaky test testRelocateWhileContinuouslyIndexingAndWaitingForRefresh #6619

[Segment Replication] Fix flaky test testRelocateWhileContinuouslyIndexingAndWaitingForRefresh #6619

Conversation

Rishikesh1159 commented Mar 10, 2023 • edited Loading

Description

Issues Resolved

Check List

dreamer-89 Mar 10, 2023

Choose a reason for hiding this comment

Rishikesh1159 Mar 10, 2023 • edited Loading

Choose a reason for hiding this comment

github-actions bot commented Mar 10, 2023

Gradle Check (Jenkins) Run Completed with:

mch2 commented Mar 11, 2023

Rishikesh1159 commented Mar 11, 2023

github-actions bot commented Mar 12, 2023

Gradle Check (Jenkins) Run Completed with:

Rishikesh1159 commented Mar 10, 2023 •

edited

Loading

Rishikesh1159 Mar 10, 2023 •

edited

Loading