-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] org.opensearch.upgrades.IndexingIT.testIndexingWithSegRep test failure #7679
Comments
@dreamer-89 can you fix this ? |
Thank you @gashutos for creating this issue. I will look into it. |
Not sure if it's the same underlying issue, but this test also fails with
|
Re-opening this issue as I noticed this in #8198 |
Tried to run the test locally with failing seed but tests ran without failure. |
Sadly, the issue is not gone, test is still faliling (as of today) but with a different reaso:
@sarthakaggarwal97 could you please take a look? thank you |
This is popping up quite a bit, suggest we mute again until fixed @dreamer-89 |
Upgrade failureAssertion trips after two nodes are upgraded to 3.0.0
Assertion tripThe replica shard sitting on the only non-upgraded node
The replica shard test-index-segrep[2] on node
|
Some updates on progress here:
|
This is interesting. Thanks for catching this @andrross |
@mch2 Has helped me out here, and it seems that in rare cases a replica is getting stuck somewhere in the segment replication process. The replica node appears to try to replicate from a node that is no longer the primary, and that process must not be failing correctly because replication gets stuck and it manifests as the replica not being able to catch up to the latest segments, leading to the test failure. I'm going to hand this issue off for @mch2 to continue to investigate/fix. |
@mch2 found that the replica seems to be getting stuck retrying replication from a node that has gone down (presumably for upgrade). A new primary exists but the replica node is stuck retrying the connection error, with a default retry timeout of 1 minute. I ran an experiment where I hardcoded the retry timeout from 1 minute to 1 second (on both the 2.x version and the main version) and the test was able to run all weekend (>2400 iterations) with no failures. |
Describe the bug
./gradlew check is failing for #7453
The text was updated successfully, but these errors were encountered: