Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SharedClusterSnapshotRestoreIT#testGetSnapshotsRequest fails #31054

Closed
danielmitterdorfer opened this issue Jun 4, 2018 · 7 comments
Closed
Assignees
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test-failure Triaged test failures from CI v5.6.11

Comments

@danielmitterdorfer
Copy link
Member

CI link: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+5.6+multijob-unix-compatibility/os=oraclelinux/1127/consoleFull

REPRODUCE WITH: ./gradlew :core:integTest \
  -Dtests.seed=683E87E60D1BDBA \
  -Dtests.class=org.elasticsearch.snapshots.SharedClusterSnapshotRestoreIT \
  -Dtests.method="testGetSnapshotsRequest" \
  -Dtests.security.manager=true \
  -Dtests.locale=hr \
  -Dtests.timezone=Atlantic/Bermuda

(does not reproduce locally)

Failure output:

[...]
07:06:40   1> [2018-06-04T04:06:12,400][INFO ][o.e.s.SharedClusterSnapshotRestoreIT] --> make sure duplicates are not returned in the response
07:06:40   1> [2018-06-04T04:06:12,633][INFO ][o.e.s.SnapshotShardsService] [node_s1] snapshot [test-repo:softezhl/dyUGda5gTdyomhUCMkNYWQ] is done
07:06:40   1> [2018-06-04T04:06:12,705][INFO ][o.e.s.SharedClusterSnapshotRestoreIT] [SharedClusterSnapshotRestoreIT#testGetSnapshotsRequest]: cleaning up after test
07:06:40   1> [2018-06-04T04:06:12,708][INFO ][o.e.c.m.MetaDataDeleteIndexService] [node_s1] [test-idx/MA3e2loDQ3Cz2tYHeU4aew] deleting index
[...]
07:06:40   1> [2018-06-04T04:06:12,762][INFO ][o.e.r.RepositoriesService] [node_s1] delete repository [test-repo]
07:06:40   1> [2018-06-04T04:06:12,764][INFO ][o.e.s.SharedClusterSnapshotRestoreIT] [SharedClusterSnapshotRestoreIT#testGetSnapshotsRequest]: cleaned up after test
07:06:40   1> [2018-06-04T04:06:12,764][INFO ][o.e.s.SharedClusterSnapshotRestoreIT] [testGetSnapshotsRequest]: after test
07:06:40 ERROR   31.1s J1 | SharedClusterSnapshotRestoreIT.testGetSnapshotsRequest <<< FAILURES!
07:06:40    > Throwable #1: com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=10538, name=elasticsearch[node_s0][clusterService#updateTask][T#1], state=RUNNABLE, group=TGRP-SharedClusterSnapshotRestoreIT]
07:06:40    > 	at __randomizedtesting.SeedInfo.seed([683E87E60D1BDBA:8703E16AE0A80C77]:0)
07:06:40    > Caused by: java.lang.AssertionError
07:06:40    > 	at __randomizedtesting.SeedInfo.seed([683E87E60D1BDBA]:0)
07:06:40    > 	at org.elasticsearch.cluster.SnapshotsInProgress$ShardSnapshotStatus.<init>(SnapshotsInProgress.java:245)
07:06:40    > 	at org.elasticsearch.snapshots.SnapshotShardsService.notifyFailedSnapshotShard(SnapshotShardsService.java:546)
07:06:40    > 	at org.elasticsearch.snapshots.SnapshotShardsService.processIndexShardSnapshots(SnapshotShardsService.java:296)
07:06:40    > 	at org.elasticsearch.snapshots.SnapshotShardsService.applyClusterState(SnapshotShardsService.java:165)
07:06:40    > 	at org.elasticsearch.cluster.service.ClusterService.callClusterStateAppliers(ClusterService.java:814)
07:06:40    > 	at org.elasticsearch.cluster.service.ClusterService.publishAndApplyChanges(ClusterService.java:768)
07:06:40    > 	at org.elasticsearch.cluster.service.ClusterService.runTasks(ClusterService.java:587)
07:06:40    > 	at org.elasticsearch.cluster.service.ClusterService$ClusterServiceTaskBatcher.run(ClusterService.java:263)
07:06:40    > 	at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150)
07:06:40    > 	at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188)
07:06:40    > 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:575)
07:06:40    > 	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:247)
07:06:40    > 	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:210)
07:06:40    > 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
07:06:40    > 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
07:06:40    > 	at java.lang.Thread.run(Thread.java:748)

The assertion in ShardSnapshotStatus indicates that we tried to initialize it with a failure but did not provide a reason.

@danielmitterdorfer danielmitterdorfer added :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test-failure Triaged test failures from CI v5.6.10 labels Jun 4, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@bleskes
Copy link
Contributor

bleskes commented Jun 4, 2018

@tlrx does this ring a bell to you?

@tlrx
Copy link
Member

tlrx commented Jun 4, 2018

@bleskes Yes, this is the same issue as #26480 which has been fixed in 6.0 and later by #28130. We decided to not backport the fix to 5.x because it was non trivial, but it's annoying that this failure comes up from time to time on 5.6. So maybe I should try to backport the change on this branch too.

@bleskes
Copy link
Contributor

bleskes commented Jun 4, 2018

@tlrx thanks. Another potential strategy is to weeken the test a bit in 5.6 to not bump into it. I'm not sure if it's relevant, but it might be less hassle.

@ywelsch
Copy link
Contributor

ywelsch commented Jun 8, 2018

@bleskes this is not limited to the above test I think
@tlrx I think we can just patch the place where the assertion trips (ShardSnapshotStatus) from

this.state = state;
this.reason = reason;
// If the state is failed we have to have a reason for this failure
assert state.failed() == false || reason != null;

to

this.state = state;
// If the state is failed we have to have a reason for this failure
if (state.failed() && reason == null) {
     reason = "failed";
}
this.reason = reason;
assert state.failed() == false || reason != null;

The reason is only for reporting purposes, but we need a non-null value for serialization purposes.

@jpountz jpountz added v5.6.11 and removed v5.6.10 labels Jun 13, 2018
imotov added a commit that referenced this issue Jul 9, 2018
@original-brownbear original-brownbear self-assigned this Jan 28, 2019
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Jan 28, 2019
* Fix assertion by workaround for `5.6`
* Reenable test that tripped this assertion
* Closes elastic#31054
original-brownbear added a commit that referenced this issue Jan 30, 2019
* Fix assertion by workaround for `5.6`
* Reenable test that tripped this assertion
* Closes #31054
@original-brownbear
Copy link
Member

Fixed in d61e45d

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test-failure Triaged test failures from CI v5.6.11
Projects
None yet
Development

No branches or pull requests

8 participants