-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rolling upgrade fails test {p0=upgraded_cluster/10_basic/Continue scroll after upgrade} #46529
Comments
Pinging @elastic/es-search |
I took a look at this and couldn't get it to reproduce reliably. It may be a timeout issue if the test machine is running very slow. |
It looks like @ywelsch recently merged a change to prevent shard relocations from happening during the upgrade (#48525). Could a shard reallocation have caused the 'search context missing' error? This seems plausible to me, but I am not a scroll expert -- perhaps @markharwood or @jimczi would be able to weigh in? |
I think this was fixed by #48525. No more failures of this test after that fix was merged. Closing this. |
@ywelsch the test was immediately disabled, so unfortunately we don't have good information about the failure rate. I will try reenabling it, and we can reopen this issue if it fails again. As a note, I looked through the full build log more carefully and saw that before the scroll failure, a bunch of tasks have piled up:
|
@jtibshirani good catch. I had missed that the test was still disabled. Let's reenable this both on master and 7.x and see if it is reoccurring (and reopen this issue then if necessary). |
@ywelsch this seems to be failing again, e.g. on master in https://gradle-enterprise.elastic.co/s/xxmcg4qhvabtw |
It looks like Mark's theory here is correct. I've looked through the node logs to find the events where the scroll was approximately started and the one where it was continued. Both are 5:30 apart, i.e. just above the 5 minute scroll timeout:
@jtibshirani can you adjust the scroll timeout in the test? |
Will do, thanks to you both for the debugging help. |
In the yaml cluster upgrade tests, we start a scroll in a mixed-version cluster, then attempt to continue the scroll after the upgrade is complete. This test occasionally fails because the scroll can expire before the cluster is done upgrading. The current scroll keep-alive time 5m. This PR bumps it to 10m, which gives a good buffer since in failing tests the time was only exceeded by ~30 seconds. Addresses #46529.
In the yaml cluster upgrade tests, we start a scroll in a mixed-version cluster, then attempt to continue the scroll after the upgrade is complete. This test occasionally fails because the scroll can expire before the cluster is done upgrading. The current scroll keep-alive time 5m. This PR bumps it to 10m, which gives a good buffer since in failing tests the time was only exceeded by ~30 seconds. Addresses #46529.
In the yaml cluster upgrade tests, we start a scroll in a mixed-version cluster, then attempt to continue the scroll after the upgrade is complete. This test occasionally fails because the scroll can expire before the cluster is done upgrading. The current scroll keep-alive time 5m. This PR bumps it to 10m, which gives a good buffer since in failing tests the time was only exceeded by ~30 seconds. Addresses #46529.
I've now bumped the keep-alive time from 5 to 10 minutes. I'll leave this open for a couple weeks, then close it out if we don't see more failures pop up. |
I haven't seen new failures since we bumped the keep-alive time, so I will close this out. |
) In the yaml cluster upgrade tests, we start a scroll in a mixed-version cluster, then attempt to continue the scroll after the upgrade is complete. This test occasionally fails because the scroll can expire before the cluster is done upgrading. The current scroll keep-alive time 5m. This PR bumps it to 10m, which gives a good buffer since in failing tests the time was only exceeded by ~30 seconds. Addresses elastic#46529.
Example:
https://gradle-enterprise.elastic.co/s/25kkewilzdsps/tests/jyp47bhnp6lbq-mbgvjsajqkfsa
Seems to affect both
7.x
andmaster
The text was updated successfully, but these errors were encountered: