rolling upgrade fails test {p0=upgraded_cluster/10_basic/Continue scroll after upgrade} #46529

alpar-t · 2019-09-10T09:46:36Z

Example:
https://gradle-enterprise.elastic.co/s/25kkewilzdsps/tests/jyp47bhnp6lbq-mbgvjsajqkfsa

Seems to affect both 7.x and master

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-09-10T09:46:37Z

Pinging @elastic/es-search

Tracked in #46529

markharwood · 2019-09-26T14:44:11Z

I took a look at this and couldn't get it to reproduce reliably.

It may be a timeout issue if the test machine is running very slow.
The test logic tries to pick up a scroll id that was left by a previous test. The timeout for the scroll is 5 minutes so if things are running especially slow then this may be insufficient and we see the errors relating to a loss of search context. We could look at increasing the timeout setting for this scroll?

jtibshirani · 2019-12-06T23:36:21Z

It looks like @ywelsch recently merged a change to prevent shard relocations from happening during the upgrade (#48525). Could a shard reallocation have caused the 'search context missing' error? This seems plausible to me, but I am not a scroll expert -- perhaps @markharwood or @jimczi would be able to weigh in?

ywelsch · 2019-12-12T17:09:06Z

I think this was fixed by #48525. No more failures of this test after that fix was merged. Closing this.

jtibshirani · 2019-12-12T18:50:51Z

@ywelsch the test was immediately disabled, so unfortunately we don't have good information about the failure rate. I will try reenabling it, and we can reopen this issue if it fails again.

As a note, I looked through the full build log more carefully and saw that before the scroll failure, a bunch of tasks have piled up:

java.lang.AssertionError: there are still running tasks:
    {time_in_queue=2.3s, time_in_queue_millis=2339, source=shard-failed, executing=true, priority=HIGH, insert_order=444}
    {time_in_queue=1m, time_in_queue_millis=62781, source=finish persistent task (success), executing=false, priority=NORMAL, insert_order=357}
    {time_in_queue=1m, time_in_queue_millis=61210, source=update task state, executing=false, priority=NORMAL, insert_order=359}
    {time_in_queue=1m, time_in_queue_millis=62211, source=update task state, executing=false, priority=NORMAL, insert_order=358}
    {time_in_queue=57.8s, time_in_queue_millis=57827, source=cluster_reroute(reroute after starting shards), executing=false, priority=NORMAL, insert_order=366}
    {time_in_queue=1m, time_in_queue_millis=60212, source=update task state, executing=false, priority=NORMAL, insert_order=360}
    {time_in_queue=59.2s, time_in_queue_millis=59212, source=update task state, executing=false, priority=NORMAL, insert_order=361}
    {time_in_queue=58.2s, time_in_queue_millis=58211, source=update task state, executing=false, priority=NORMAL, insert_order=362}
    {time_in_queue=50.2s, time_in_queue_millis=50211, source=update task state, executing=false, priority=NORMAL, insert_order=378}
    {time_in_queue=53.2s, time_in_queue_millis=53210, source=update task state, executing=false, priority=NORMAL, insert_order=372}
    {time_in_queue=57.2s, time_in_queue_millis=57212, source=update task state, executing=false, priority=NORMAL, insert_order=368}
    {time_in_queue=56.2s, time_in_queue_millis=56211, source=update task state, executing=false, priority=NORMAL, insert_order=369}
    {time_in_queue=55.2s, time_in_queue_millis=55210, source=update task state, executing=false, priority=NORMAL, insert_order=370}
...

jtibshirani · 2019-12-12T19:40:55Z

The test was reenabled:

d4ad75e (master)
73c4120 (7.x)
ba2dc49 (7.5)

ywelsch · 2019-12-12T20:26:47Z

@jtibshirani good catch. I had missed that the test was still disabled. Let's reenable this both on master and 7.x and see if it is reoccurring (and reopen this issue then if necessary).

dliappis · 2019-12-13T10:46:57Z

@ywelsch this seems to be failing again, e.g. on master in https://gradle-enterprise.elastic.co/s/xxmcg4qhvabtw

ywelsch · 2019-12-13T17:57:52Z

It looks like Mark's theory here is correct. I've looked through the node logs to find the events where the scroll was approximately started and the one where it was continued. Both are 5:30 apart, i.e. just above the 5 minute scroll timeout:

[2019-12-13T09:17:30,289][INFO ][o.e.c.m.MetaDataCreateIndexService] [v8.0.0-2] [upgraded_scroll] creating index, cause [api], templates [], shards [1]/[0], mappings []
...
[2019-12-13T09:23:03,402][DEBUG][o.e.a.s.TransportSearchScrollAction] [v8.0.0-0] [126] Failed to execute query phase
org.elasticsearch.transport.RemoteTransportException: [v8.0.0-0][127.0.0.1:41727][indices:data/read/search[phase/query+fetch/scroll]]
Caused by: org.elasticsearch.search.SearchContextMissingException: No search context found for id [126]

@jtibshirani can you adjust the scroll timeout in the test?

jtibshirani · 2019-12-13T18:35:52Z

Will do, thanks to you both for the debugging help.

In the yaml cluster upgrade tests, we start a scroll in a mixed-version cluster, then attempt to continue the scroll after the upgrade is complete. This test occasionally fails because the scroll can expire before the cluster is done upgrading. The current scroll keep-alive time 5m. This PR bumps it to 10m, which gives a good buffer since in failing tests the time was only exceeded by ~30 seconds. Addresses #46529.

jtibshirani · 2019-12-16T20:41:47Z

I've now bumped the keep-alive time from 5 to 10 minutes. I'll leave this open for a couple weeks, then close it out if we don't see more failures pop up.

jtibshirani · 2020-01-06T18:17:28Z

I haven't seen new failures since we bumped the keep-alive time, so I will close this out.

) In the yaml cluster upgrade tests, we start a scroll in a mixed-version cluster, then attempt to continue the scroll after the upgrade is complete. This test occasionally fails because the scroll can expire before the cluster is done upgrading. The current scroll keep-alive time 5m. This PR bumps it to 10m, which gives a good buffer since in failing tests the time was only exceeded by ~30 seconds. Addresses elastic#46529.

alpar-t added :Search/Search Search-related issues that do not fall into other categories >test-failure Triaged test failures from CI labels Sep 10, 2019

alpar-t added a commit that referenced this issue Sep 10, 2019

Mute test in master

9e70a8e

Tracked in #46529

alpar-t added a commit that referenced this issue Sep 10, 2019

Mute test in 7.x

0ac52d0

Tracked in #46529

jtibshirani self-assigned this Sep 30, 2019

ywelsch closed this as completed Dec 12, 2019

dliappis mentioned this issue Dec 13, 2019

[CI] Rolling upgrade tests fail in org.elasticsearch.upgrades.UpgradeClusterClientYamlTestSuiteIT.test {p0=upgraded_cluster/10_basic/Continue scroll after upgrade} #50172

Closed

dliappis reopened this Dec 13, 2019

jtibshirani mentioned this issue Dec 13, 2019

Bump the scroll keep-alive time in cluster upgrade tests. #50195

Merged

henningandersen mentioned this issue Dec 16, 2019

Disk threshold decider is enabled by default #50222

Merged

jtibshirani closed this as completed Jan 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rolling upgrade fails test {p0=upgraded_cluster/10_basic/Continue scroll after upgrade} #46529

rolling upgrade fails test {p0=upgraded_cluster/10_basic/Continue scroll after upgrade} #46529

alpar-t commented Sep 10, 2019

elasticmachine commented Sep 10, 2019

markharwood commented Sep 26, 2019

jtibshirani commented Dec 6, 2019

ywelsch commented Dec 12, 2019 •

edited

Loading

jtibshirani commented Dec 12, 2019 •

edited

Loading

jtibshirani commented Dec 12, 2019 •

edited

Loading

ywelsch commented Dec 12, 2019

dliappis commented Dec 13, 2019

ywelsch commented Dec 13, 2019

jtibshirani commented Dec 13, 2019

jtibshirani commented Dec 16, 2019

jtibshirani commented Jan 6, 2020

rolling upgrade fails test {p0=upgraded_cluster/10_basic/Continue scroll after upgrade} #46529

rolling upgrade fails test {p0=upgraded_cluster/10_basic/Continue scroll after upgrade} #46529

Comments

alpar-t commented Sep 10, 2019

elasticmachine commented Sep 10, 2019

markharwood commented Sep 26, 2019

jtibshirani commented Dec 6, 2019

ywelsch commented Dec 12, 2019 • edited Loading

jtibshirani commented Dec 12, 2019 • edited Loading

jtibshirani commented Dec 12, 2019 • edited Loading

ywelsch commented Dec 12, 2019

dliappis commented Dec 13, 2019

ywelsch commented Dec 13, 2019

jtibshirani commented Dec 13, 2019

jtibshirani commented Dec 16, 2019

jtibshirani commented Jan 6, 2020

ywelsch commented Dec 12, 2019 •

edited

Loading

jtibshirani commented Dec 12, 2019 •

edited

Loading

jtibshirani commented Dec 12, 2019 •

edited

Loading