-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Segment Replication] Fix flaky tests testSegmentReplicationStatsResponse() and testSegmentReplicationStatsWithTimeout() #6268
[Segment Replication] Fix flaky tests testSegmentReplicationStatsResponse() and testSegmentReplicationStatsWithTimeout() #6268
Conversation
Signed-off-by: Rishikesh1159 <[email protected]>
Gradle Check (Jenkins) Run Completed with:
|
server/src/internalClusterTest/java/org/opensearch/action/admin/ClientTimeoutIT.java
Outdated
Show resolved
Hide resolved
server/src/internalClusterTest/java/org/opensearch/action/admin/ClientTimeoutIT.java
Show resolved
Hide resolved
private final Map<String, Long> timingData; | ||
|
||
// Timing data will have as many entries as stages, plus one | ||
private final Map<String, Long> timingData = new ConcurrentHashMap<>(Stage.values().length + 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we still want to init this in constructors only and send this map over the wire, otherwise the map will empty on the other end when querying stats. You can make the map concurrent in the ctor or copy it to write.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am going with copying the map and sending copy of map in output stream.
Signed-off-by: Rishikesh1159 <[email protected]>
Gradle Check (Jenkins) Run Completed with:
|
Gradle Check (Jenkins) Run Completed with:
|
Codecov Report
📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more @@ Coverage Diff @@
## main #6268 +/- ##
============================================
- Coverage 70.82% 70.53% -0.29%
+ Complexity 59078 58782 -296
============================================
Files 4799 4799
Lines 282434 282441 +7
Branches 40717 40716 -1
============================================
- Hits 200041 199232 -809
- Misses 65977 66780 +803
- Partials 16416 16429 +13
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
Signed-off-by: Rishikesh1159 <[email protected]>
Gradle Check (Jenkins) Run Completed with:
|
Signed-off-by: Rishikesh1159 <[email protected]>
Signed-off-by: Rishikesh1159 <[email protected]>
Gradle Check (Jenkins) Run Completed with:
|
Gradle Check (Jenkins) Run Completed with:
|
// index 10 docs | ||
for (int i = 0; i < 10; i++) { | ||
client().prepareIndex(INDEX_NAME).setId(Integer.toString(i)).setSource("field", "value" + i).execute().actionGet(); | ||
public void testSegmentReplicationStatsResponse() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Rishikesh1159 : I also fixed this flaky test as part of #6370. Please rebase your changes against main
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure let me rebase
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, I see one run where this is still flaky #6366 (comment).
Previously, I root caused the problem happening due to race condition and fixed by asserting on replication state to be done which fixed it from 20% failing -> 0%. The assertBusy block by default waits for 10s
not sure if segment replication needs more time to complete. May be we can increase the assertBusy timeout to 60secs and check if this test is still flaky.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes testSegmentReplicationStatsResponse() was flaky previously because of stage not reaching done within certain time. But now I have changed the test completely. As we are only testing API response and we are concerned only about API response but not about segment replication stage. So I changed the test and we no more assert on segrep stage which was causing flakiness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With changes in testSegmentReplicationStatsResponse() we only assert if we get a successful or failure response from API call. We are not concerned about if segment replication event has finished processing or not, as purpose of API is only to give response irrespective of state of segrep event.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is still value in asserting the replication state as it can uncover bugs where replication never completes for various reasons and it is also part of API response. I will leave it for you to decide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I can add asserts on replication state. It doesn't break anything but as you said there is a chance it can uncover some bugs. I have added that in latest commit. Thanks @dreamer-89
Gradle Check (Jenkins) Run Completed with:
|
Signed-off-by: Rishikesh1159 <[email protected]>
Gradle Check (Jenkins) Run Completed with:
|
Signed-off-by: Rishikesh1159 <[email protected]>
Gradle Check (Jenkins) Run Completed with:
|
@@ -121,6 +131,10 @@ public void testSegmentReplicationStatsResponseForActiveAndCompletedOnly() throw | |||
completedOnlyResponse.shardSegmentReplicationStates().get(INDEX_NAME).get(0).getStage(), | |||
SegmentReplicationState.Stage.DONE | |||
); | |||
assertThat( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can use assertEquals
instead of assertThat
here. This is applicable for other usages of assertThat
as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure I will update all usages
// index 10 docs | ||
for (int i = 0; i < 10; i++) { | ||
client().prepareIndex(INDEX_NAME).setId(Integer.toString(i)).setSource("field", "value" + i).execute().actionGet(); | ||
public void testSegmentReplicationStatsResponse() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is still value in asserting the replication state as it can uncover bugs where replication never completes for various reasons and it is also part of API response. I will leave it for you to decide.
Signed-off-by: Rishikesh1159 <[email protected]>
Gradle Check (Jenkins) Run Completed with:
|
@@ -140,7 +140,7 @@ protected SegmentReplicationState shardOperation(SegmentReplicationStatsRequest | |||
singleIndexWithSegmentReplicationDisabled = shardRouting.getIndexName(); | |||
return null; | |||
} | |||
if (indexShard.indexSettings().isSegRepEnabled() == false) { | |||
if (indexShard.indexSettings().isSegRepEnabled() == false || shardRouting.primary()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this change is needed ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If shard is primary we don't even need to make a call to get segment replication state, as segment replication only happens on replica shards. If we don't have this primary check here, it will make a call to get segment replication state and check some collections and eventually return null on a primary shard. So with this addition of condition check we are just returning null early if it is a primary sahrd
} | ||
refresh(INDEX_NAME); | ||
waitForSearchableDocs(10L, asList(primaryNode, replicaNode)); | ||
ensureSearchable(INDEX_NAME); | ||
|
||
assertBusy(() -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I mentioned here, seen one instance where this test failed. May be we can increase the assertBusy timeout and see if it helps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah sure. May be I can increase it to 60sec just to be certain that replication has completed as you said before.
Signed-off-by: Rishikesh1159 <[email protected]>
Gradle Check (Jenkins) Run Completed with:
|
Issue already opened for these flaky tests #5958 |
Signed-off-by: Rishikesh1159 [email protected]
Description
This PR :
-> Fixes flaky tests testSegmentReplicationStatsWithTimeout() and testSegmentReplicationStatsResponse().
-> Makes timingdata map to a concurrent map to handle concurrent modifications.
-> Moves timing data under detailed flag
Issues Resolved
Resolves #6261, #6255
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.