-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Fix new flaky test org.opensearch.search.DeletePitMultiNodeTests.testDeletePitWhileNodeDrop
#4089
Comments
@bharath-techie could you please take a look before it became the problem? thank you. |
Ran 100 times, reproduced failure locally once: REPRODUCE WITH: ./gradlew 'null' --tests "org.opensearch.search.PitMultiNodeTests.testCreatePitWhileNodeDropWithAllowPartialCreationFalse" -Dtests.seed=892E2A0FCCF769CA -Dtests.locale=en-PH -Dtests.timezone=America/Hermosillo -Druntime.java=14 java.lang.AssertionError:
|
Reproduced again with: for i in {1..100}; do gradle ':server:test' --tests "org.opensearch.search.PitMultiNodeTests.testCreatePitWhileNodeDropWithAllowPartialCreationFalse"; done
|
Analysis so far: Three-node cluster setup with The test creates a random index which is replicated to 2 shards, apparently on
However, there is no triggering of ClusterState updated on
Then
Node is closed and PIT fails (as expected):
Index is then deleted (08.516) and then tested (08.545-08.547) and test fails before "after test" (08.561). Test failure is because the index count is still 1.
Suspect race condition where 29 ms is not long enough to check result. |
Reproduced again (run 109 of 500). This time s1 original cluster manager and rebooted node, s0 became new manager. Again short time between deleting and "after test":
|
NOTE: this comment has been copied here. Continue further discussion there. Have spent several hours digging into this issue. Initial failure in subject line (...Delete...) is different than the failure reported in this comment. (...Create...). For the create failure; it's obviously a race condition and close to a Heisenbug. While regularly reproducible at least once in a run of size 500, adding logging extended successful runs with a failure on run 2463. Commonalities when it fails:
Tracing through the debug logs I added, I'm at the "how did this ever work" phase of debugging. The boolean OpenSearch/server/src/main/java/org/opensearch/action/search/CreatePitController.java Lines 123 to 153 in ea1cc9d
The getter Looks like this code was implemented in #3921. @bharath-techie do you have any observations/comments that can further help debug this? |
NOTE: The subject line of this Issue indicates it was likely fixed by #4632. However comments starting here are related to a different failure and belong on issue #4259. TLDR:
|
Describe the bug
New flaky test
org.opensearch.search.DeletePitMultiNodeTests.testDeletePitWhileNodeDrop
, first spotted in [1], introduced by [2][1] https://build.ci.opensearch.org/job/gradle-check/1269/testReport/junit/org.opensearch.search/DeletePitMultiNodeTests/testDeletePitWhileNodeDrop/
[2] #3949
To Reproduce
Expected behavior
Test must pass reliably
Plugins
Standard
Screenshots
Host/Environment (please complete the following information):
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: