Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rolling upgrade fails test {p0=upgraded_cluster/10_basic/Continue scroll after upgrade} #46529

Closed
alpar-t opened this issue Sep 10, 2019 · 12 comments
Assignees
Labels
:Search/Search Search-related issues that do not fall into other categories >test-failure Triaged test failures from CI

Comments

@alpar-t
Copy link
Contributor

alpar-t commented Sep 10, 2019

Example:
https://gradle-enterprise.elastic.co/s/25kkewilzdsps/tests/jyp47bhnp6lbq-mbgvjsajqkfsa

Seems to affect both 7.x and master

@alpar-t alpar-t added :Search/Search Search-related issues that do not fall into other categories >test-failure Triaged test failures from CI labels Sep 10, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search

alpar-t added a commit that referenced this issue Sep 10, 2019
alpar-t added a commit that referenced this issue Sep 10, 2019
@markharwood
Copy link
Contributor

I took a look at this and couldn't get it to reproduce reliably.

It may be a timeout issue if the test machine is running very slow.
The test logic tries to pick up a scroll id that was left by a previous test. The timeout for the scroll is 5 minutes so if things are running especially slow then this may be insufficient and we see the errors relating to a loss of search context. We could look at increasing the timeout setting for this scroll?

@jtibshirani jtibshirani self-assigned this Sep 30, 2019
@jtibshirani
Copy link
Contributor

It looks like @ywelsch recently merged a change to prevent shard relocations from happening during the upgrade (#48525). Could a shard reallocation have caused the 'search context missing' error? This seems plausible to me, but I am not a scroll expert -- perhaps @markharwood or @jimczi would be able to weigh in?

@ywelsch
Copy link
Contributor

ywelsch commented Dec 12, 2019

I think this was fixed by #48525. No more failures of this test after that fix was merged. Closing this.

@ywelsch ywelsch closed this as completed Dec 12, 2019
@jtibshirani
Copy link
Contributor

jtibshirani commented Dec 12, 2019

@ywelsch the test was immediately disabled, so unfortunately we don't have good information about the failure rate. I will try reenabling it, and we can reopen this issue if it fails again.

As a note, I looked through the full build log more carefully and saw that before the scroll failure, a bunch of tasks have piled up:

java.lang.AssertionError: there are still running tasks:
    {time_in_queue=2.3s, time_in_queue_millis=2339, source=shard-failed, executing=true, priority=HIGH, insert_order=444}
    {time_in_queue=1m, time_in_queue_millis=62781, source=finish persistent task (success), executing=false, priority=NORMAL, insert_order=357}
    {time_in_queue=1m, time_in_queue_millis=61210, source=update task state, executing=false, priority=NORMAL, insert_order=359}
    {time_in_queue=1m, time_in_queue_millis=62211, source=update task state, executing=false, priority=NORMAL, insert_order=358}
    {time_in_queue=57.8s, time_in_queue_millis=57827, source=cluster_reroute(reroute after starting shards), executing=false, priority=NORMAL, insert_order=366}
    {time_in_queue=1m, time_in_queue_millis=60212, source=update task state, executing=false, priority=NORMAL, insert_order=360}
    {time_in_queue=59.2s, time_in_queue_millis=59212, source=update task state, executing=false, priority=NORMAL, insert_order=361}
    {time_in_queue=58.2s, time_in_queue_millis=58211, source=update task state, executing=false, priority=NORMAL, insert_order=362}
    {time_in_queue=50.2s, time_in_queue_millis=50211, source=update task state, executing=false, priority=NORMAL, insert_order=378}
    {time_in_queue=53.2s, time_in_queue_millis=53210, source=update task state, executing=false, priority=NORMAL, insert_order=372}
    {time_in_queue=57.2s, time_in_queue_millis=57212, source=update task state, executing=false, priority=NORMAL, insert_order=368}
    {time_in_queue=56.2s, time_in_queue_millis=56211, source=update task state, executing=false, priority=NORMAL, insert_order=369}
    {time_in_queue=55.2s, time_in_queue_millis=55210, source=update task state, executing=false, priority=NORMAL, insert_order=370}
...

@jtibshirani
Copy link
Contributor

jtibshirani commented Dec 12, 2019

The test was reenabled:

@ywelsch
Copy link
Contributor

ywelsch commented Dec 12, 2019

@jtibshirani good catch. I had missed that the test was still disabled. Let's reenable this both on master and 7.x and see if it is reoccurring (and reopen this issue then if necessary).

@dliappis
Copy link
Contributor

@ywelsch this seems to be failing again, e.g. on master in https://gradle-enterprise.elastic.co/s/xxmcg4qhvabtw

@ywelsch
Copy link
Contributor

ywelsch commented Dec 13, 2019

It looks like Mark's theory here is correct. I've looked through the node logs to find the events where the scroll was approximately started and the one where it was continued. Both are 5:30 apart, i.e. just above the 5 minute scroll timeout:

[2019-12-13T09:17:30,289][INFO ][o.e.c.m.MetaDataCreateIndexService] [v8.0.0-2] [upgraded_scroll] creating index, cause [api], templates [], shards [1]/[0], mappings []
...
[2019-12-13T09:23:03,402][DEBUG][o.e.a.s.TransportSearchScrollAction] [v8.0.0-0] [126] Failed to execute query phase
org.elasticsearch.transport.RemoteTransportException: [v8.0.0-0][127.0.0.1:41727][indices:data/read/search[phase/query+fetch/scroll]]
Caused by: org.elasticsearch.search.SearchContextMissingException: No search context found for id [126]

@jtibshirani can you adjust the scroll timeout in the test?

@jtibshirani
Copy link
Contributor

Will do, thanks to you both for the debugging help.

jtibshirani added a commit that referenced this issue Dec 16, 2019
In the yaml cluster upgrade tests, we start a scroll in a mixed-version cluster,
then attempt to continue the scroll after the upgrade is complete. This test
occasionally fails because the scroll can expire before the cluster is done
upgrading.

The current scroll keep-alive time 5m. This PR bumps it to 10m, which gives a
good buffer since in failing tests the time was only exceeded by ~30 seconds.

Addresses #46529.
jtibshirani added a commit that referenced this issue Dec 16, 2019
In the yaml cluster upgrade tests, we start a scroll in a mixed-version cluster,
then attempt to continue the scroll after the upgrade is complete. This test
occasionally fails because the scroll can expire before the cluster is done
upgrading.

The current scroll keep-alive time 5m. This PR bumps it to 10m, which gives a
good buffer since in failing tests the time was only exceeded by ~30 seconds.

Addresses #46529.
jtibshirani added a commit that referenced this issue Dec 16, 2019
In the yaml cluster upgrade tests, we start a scroll in a mixed-version cluster,
then attempt to continue the scroll after the upgrade is complete. This test
occasionally fails because the scroll can expire before the cluster is done
upgrading.

The current scroll keep-alive time 5m. This PR bumps it to 10m, which gives a
good buffer since in failing tests the time was only exceeded by ~30 seconds.

Addresses #46529.
@jtibshirani
Copy link
Contributor

I've now bumped the keep-alive time from 5 to 10 minutes. I'll leave this open for a couple weeks, then close it out if we don't see more failures pop up.

@jtibshirani
Copy link
Contributor

I haven't seen new failures since we bumped the keep-alive time, so I will close this out.

SivagurunathanV pushed a commit to SivagurunathanV/elasticsearch that referenced this issue Jan 23, 2020
)

In the yaml cluster upgrade tests, we start a scroll in a mixed-version cluster,
then attempt to continue the scroll after the upgrade is complete. This test
occasionally fails because the scroll can expire before the cluster is done
upgrading.

The current scroll keep-alive time 5m. This PR bumps it to 10m, which gives a
good buffer since in failing tests the time was only exceeded by ~30 seconds.

Addresses elastic#46529.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Search/Search Search-related issues that do not fall into other categories >test-failure Triaged test failures from CI
Projects
None yet
Development

No branches or pull requests

6 participants