-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CI] CCR: testFollowIndexAndCloseNode fails #33337
Comments
Pinging @elastic/es-distributed |
I've muted this test on master and 6.x. |
This test failed because we assigned different terms (1 and 2) to seq=133. This happened in these steps:
One solution I can see is to let ShardFollowNodeTask assign the primary term once before dispatching operations. We can piggyback the current term on the primary in responses (even though we won't have the latest term on ShardFollowNodeTask all the time). @bleskes and @jasontedor WDYT? |
This still is an issue with the proposal if ShardFollowNodeTask is restarted. Or let's relax the |
I think this specific situation will be avoided with rollbacks in place, correct? I did make me think of a deeper problem. I added this as a discussion topic for our next sync. |
No, as we are testing with 1 replica on both sides. |
I've un-muted this test. It should be okay now as the FollowingEngine will skip an operation which was already processed before (see #34099). I am closing this issue since @jasontedor is working on a broader fix which covers this issue. /cc @martijnvg |
We haven't resolved this issue completely. I will mute this test. |
this failed again for me in a PR CI run link https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+pull-request/1500/console reproduce
stacktrace
OH, I take that back, I might be stale. will merge in latest master |
Sorry for the noise @talevy. I am re-opening this and will look into it soon. |
thanks @dnhatn, this branch was working off of master as of yesterday, working off of this time in history, I believe, https://github.com/elastic/elasticsearch/tree/7ef65dedc36735a0e84f482bd9fbc4acab9f7a17 |
Another failure in this suite this time on 6.5. The actuals errors look different but may be related so I'm documenting here rather than opening a new issue Obviously does not reproduce:
Some of the errors are:
|
Another failure of https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+intake/135/console Does not reproduce:
The failed assertion is different to the one in the opening comment
|
The suite FollowerFailOverIT is failing because some documents are not replicated to the follower. Maybe the FollowTask is not working as expected or the background indexers eat all resources while the follower cluster is trying to reform after a failover; then CI is not fast enough to replicate all the indexed docs within 60 seconds (sometimes I see 80k docs on the leader). This commit limits the number of documents to be indexed into the leader index by the background threads so that we can eliminate the latter case. This change also replaces a docCount assertion with a docIds assertion so we can have more information if these tests fail again. Relates #33337
The suite FollowerFailOverIT is failing because some documents are not replicated to the follower. Maybe the FollowTask is not working as expected or the background indexers eat all resources while the follower cluster is trying to reform after a failover; then CI is not fast enough to replicate all the indexed docs within 60 seconds (sometimes I see 80k docs on the leader). This commit limits the number of documents to be indexed into the leader index by the background threads so that we can eliminate the latter case. This change also replaces a docCount assertion with a docIds assertion so we can have more information if these tests fail again. Relates #33337
The suite FollowerFailOverIT is failing because some documents are not replicated to the follower. Maybe the FollowTask is not working as expected or the background indexers eat all resources while the follower cluster is trying to reform after a failover; then CI is not fast enough to replicate all the indexed docs within 60 seconds (sometimes I see 80k docs on the leader). This commit limits the number of documents to be indexed into the leader index by the background threads so that we can eliminate the latter case. This change also replaces a docCount assertion with a docIds assertion so we can have more information if these tests fail again. Relates #33337
This started to fail again. The latest failure: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+g1gc/113/console The test checks whether all operations have fully replicated to the follow shards, only for one shard it turns out there no ops:
After a random non master node is stopped in the follower cluster the following warnings do show up in the log:
I'm trying to understand why these follower shards are marked as stale. |
Failed on intake too: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+intake/1194/console |
This failure is different than before. I'm looking into this now. |
Turns out this is not because of delayed allocation. No attempt was made to assign a replica shard of the follower index. I'm going to add more information in the logs of this test. |
Another one here: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.6+internalClusterTest/70/console
|
This looks like another one today: @dnhatn Let me know if I should mute this test again or if you are still looking for more logs. |
Another one today: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.6+internalClusterTest/528/console @dnhatn Can you mute this if you got what you are looking for in the logs? |
@cbuescher Thanks for the ping. I'll this test now. |
Tracked at elastic#33337
Tracked at elastic#33337
I ran into this on one of my PR builds and investigated it a bit. I can reproduce this around 1/50 times. I start the single test method in IntelliJ, with Repeat until failure on. |
@dnhatn Can this test be unmuted now? |
@martijnvg We need to wait for #39467. |
Resolved by #39584 |
I have unmuted this test. |
testFollowIndexAndCloseNode
fails on 6.x:This may be the actual reason.
CI: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+intake/2388/console
Log: testFollowIndexAndCloseNode.txt.zip
The text was updated successfully, but these errors were encountered: