SharedClusterSnapshotRestoreIT#testGetSnapshotsRequest fails #31054

danielmitterdorfer · 2018-06-04T08:12:29Z

CI link: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+5.6+multijob-unix-compatibility/os=oraclelinux/1127/consoleFull

REPRODUCE WITH: ./gradlew :core:integTest \
  -Dtests.seed=683E87E60D1BDBA \
  -Dtests.class=org.elasticsearch.snapshots.SharedClusterSnapshotRestoreIT \
  -Dtests.method="testGetSnapshotsRequest" \
  -Dtests.security.manager=true \
  -Dtests.locale=hr \
  -Dtests.timezone=Atlantic/Bermuda

(does not reproduce locally)

Failure output:

[...]
07:06:40   1> [2018-06-04T04:06:12,400][INFO ][o.e.s.SharedClusterSnapshotRestoreIT] --> make sure duplicates are not returned in the response
07:06:40   1> [2018-06-04T04:06:12,633][INFO ][o.e.s.SnapshotShardsService] [node_s1] snapshot [test-repo:softezhl/dyUGda5gTdyomhUCMkNYWQ] is done
07:06:40   1> [2018-06-04T04:06:12,705][INFO ][o.e.s.SharedClusterSnapshotRestoreIT] [SharedClusterSnapshotRestoreIT#testGetSnapshotsRequest]: cleaning up after test
07:06:40   1> [2018-06-04T04:06:12,708][INFO ][o.e.c.m.MetaDataDeleteIndexService] [node_s1] [test-idx/MA3e2loDQ3Cz2tYHeU4aew] deleting index
[...]
07:06:40   1> [2018-06-04T04:06:12,762][INFO ][o.e.r.RepositoriesService] [node_s1] delete repository [test-repo]
07:06:40   1> [2018-06-04T04:06:12,764][INFO ][o.e.s.SharedClusterSnapshotRestoreIT] [SharedClusterSnapshotRestoreIT#testGetSnapshotsRequest]: cleaned up after test
07:06:40   1> [2018-06-04T04:06:12,764][INFO ][o.e.s.SharedClusterSnapshotRestoreIT] [testGetSnapshotsRequest]: after test
07:06:40 ERROR   31.1s J1 | SharedClusterSnapshotRestoreIT.testGetSnapshotsRequest <<< FAILURES!
07:06:40    > Throwable #1: com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=10538, name=elasticsearch[node_s0][clusterService#updateTask][T#1], state=RUNNABLE, group=TGRP-SharedClusterSnapshotRestoreIT]
07:06:40    > 	at __randomizedtesting.SeedInfo.seed([683E87E60D1BDBA:8703E16AE0A80C77]:0)
07:06:40    > Caused by: java.lang.AssertionError
07:06:40    > 	at __randomizedtesting.SeedInfo.seed([683E87E60D1BDBA]:0)
07:06:40    > 	at org.elasticsearch.cluster.SnapshotsInProgress$ShardSnapshotStatus.<init>(SnapshotsInProgress.java:245)
07:06:40    > 	at org.elasticsearch.snapshots.SnapshotShardsService.notifyFailedSnapshotShard(SnapshotShardsService.java:546)
07:06:40    > 	at org.elasticsearch.snapshots.SnapshotShardsService.processIndexShardSnapshots(SnapshotShardsService.java:296)
07:06:40    > 	at org.elasticsearch.snapshots.SnapshotShardsService.applyClusterState(SnapshotShardsService.java:165)
07:06:40    > 	at org.elasticsearch.cluster.service.ClusterService.callClusterStateAppliers(ClusterService.java:814)
07:06:40    > 	at org.elasticsearch.cluster.service.ClusterService.publishAndApplyChanges(ClusterService.java:768)
07:06:40    > 	at org.elasticsearch.cluster.service.ClusterService.runTasks(ClusterService.java:587)
07:06:40    > 	at org.elasticsearch.cluster.service.ClusterService$ClusterServiceTaskBatcher.run(ClusterService.java:263)
07:06:40    > 	at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150)
07:06:40    > 	at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188)
07:06:40    > 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:575)
07:06:40    > 	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:247)
07:06:40    > 	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:210)
07:06:40    > 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
07:06:40    > 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
07:06:40    > 	at java.lang.Thread.run(Thread.java:748)

The assertion in ShardSnapshotStatus indicates that we tried to initialize it with a failure but did not provide a reason.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2018-06-04T08:12:30Z

Pinging @elastic/es-distributed

bleskes · 2018-06-04T10:46:41Z

@tlrx does this ring a bell to you?

tlrx · 2018-06-04T11:12:36Z

@bleskes Yes, this is the same issue as #26480 which has been fixed in 6.0 and later by #28130. We decided to not backport the fix to 5.x because it was non trivial, but it's annoying that this failure comes up from time to time on 5.6. So maybe I should try to backport the change on this branch too.

bleskes · 2018-06-04T12:36:22Z

@tlrx thanks. Another potential strategy is to weeken the test a bit in 5.6 to not bump into it. I'm not sure if it's relevant, but it might be less hassle.

ywelsch · 2018-06-08T09:12:13Z

@bleskes this is not limited to the above test I think
@tlrx I think we can just patch the place where the assertion trips (ShardSnapshotStatus) from

this.state = state;
this.reason = reason;
// If the state is failed we have to have a reason for this failure
assert state.failed() == false || reason != null;

to

this.state = state;
// If the state is failed we have to have a reason for this failure
if (state.failed() && reason == null) {
     reason = "failed";
}
this.reason = reason;
assert state.failed() == false || reason != null;

The reason is only for reporting purposes, but we need a non-null value for serialization purposes.

imotov · 2018-07-09T15:58:39Z

Failed in https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+5.6+matrix-java-periodic/ES_BUILD_JAVA=java8,ES_RUNTIME_JAVA=java8,nodes=virtual&&linux/164/console

Tracked by #31054

* Fix assertion by workaround for `5.6` * Reenable test that tripped this assertion * Closes elastic#31054

* Fix assertion by workaround for `5.6` * Reenable test that tripped this assertion * Closes #31054

original-brownbear · 2019-01-30T12:07:56Z

Fixed in d61e45d

danielmitterdorfer added :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test-failure Triaged test failures from CI v5.6.10 labels Jun 4, 2018

jpountz added v5.6.11 and removed v5.6.10 labels Jun 13, 2018

imotov added a commit that referenced this issue Jul 9, 2018

Mute SharedClusterSnapshotRestoreIT.testSnapshotRequest

c363c74

Tracked by #31054

original-brownbear self-assigned this Jan 28, 2019

original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Jan 28, 2019

Fix Failing Assertion in SnapshotsInProgress

7d43d6a

* Fix assertion by workaround for `5.6` * Reenable test that tripped this assertion * Closes elastic#31054

original-brownbear mentioned this issue Jan 28, 2019

Fix Failing Assertion in SnapshotsInProgress #37922

Merged

original-brownbear added a commit that referenced this issue Jan 30, 2019

Fix Failing Assertion in SnapshotsInProgress (#37922)

d61e45d

* Fix assertion by workaround for `5.6` * Reenable test that tripped this assertion * Closes #31054

original-brownbear closed this as completed Jan 30, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SharedClusterSnapshotRestoreIT#testGetSnapshotsRequest fails #31054

SharedClusterSnapshotRestoreIT#testGetSnapshotsRequest fails #31054

danielmitterdorfer commented Jun 4, 2018

elasticmachine commented Jun 4, 2018

bleskes commented Jun 4, 2018

tlrx commented Jun 4, 2018

bleskes commented Jun 4, 2018 •

edited

Loading

ywelsch commented Jun 8, 2018

imotov commented Jul 9, 2018

original-brownbear commented Jan 30, 2019

SharedClusterSnapshotRestoreIT#testGetSnapshotsRequest fails #31054

SharedClusterSnapshotRestoreIT#testGetSnapshotsRequest fails #31054

Comments

danielmitterdorfer commented Jun 4, 2018

elasticmachine commented Jun 4, 2018

bleskes commented Jun 4, 2018

tlrx commented Jun 4, 2018

bleskes commented Jun 4, 2018 • edited Loading

ywelsch commented Jun 8, 2018

imotov commented Jul 9, 2018

original-brownbear commented Jan 30, 2019

bleskes commented Jun 4, 2018 •

edited

Loading