[CI] SharedClusterSnapshotRestoreIT#testGetSnapshotsRequest failing #26480

talevy · 2017-09-02T17:15:18Z

I feel that I am to blame for this because of the #26463, but I cannot reproduce locally

link:
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+oracle-periodic/605/console

trace

20:51:14   1> org.elasticsearch.repositories.RepositoryException: [test-repo] could not read repository data from index blob
20:51:14   1> 	at org.elasticsearch.repositories.blobstore.BlobStoreRepository.getRepositoryData(BlobStoreRepository.java:648) ~[main/:?]
20:51:14   1> 	at org.elasticsearch.snapshots.SnapshotsService.createSnapshot(SnapshotsService.java:236) ~[main/:?]
20:51:14   1> 	at org.elasticsearch.action.admin.cluster.snapshots.create.TransportCreateSnapshotAction.masterOperation(TransportCreateSnapshotAction.java:82) ~[main/:?]
20:51:14   1> 	at org.elasticsearch.action.admin.cluster.snapshots.create.TransportCreateSnapshotAction.masterOperation(TransportCreateSnapshotAction.java:41) ~[main/:?]
20:51:14   1> 	at org.elasticsearch.action.support.master.TransportMasterNodeAction.masterOperation(TransportMasterNodeAction.java:87) ~[main/:?]
20:51:14   1> 	at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$2.doRun(TransportMasterNodeAction.java:166) ~[main/:?]
20:51:14   1> 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) ~[main/:?]
20:51:14   1> 	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[main/:?]
20:51:14   1> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_144]
20:51:14   1> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_144]
20:51:14   1> 	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_144]
20:51:14   1> Caused by: java.io.IOException: Random IOException
20:51:14   1> 	at org.elasticsearch.snapshots.mockstore.MockRepository$MockBlobStore$MockBlobContainer.maybeIOExceptionOrBlock(MockRepository.java:276) ~[test/:?]
20:51:14   1> 	at org.elasticsearch.snapshots.mockstore.MockRepository$MockBlobStore$MockBlobContainer.listBlobsByPrefix(MockRepository.java:336) ~[test/:?]
20:51:14   1> 	at org.elasticsearch.repositories.blobstore.BlobStoreRepository.listBlobsToGetLatestIndexId(BlobStoreRepository.java:777) ~[main/:?]
20:51:14   1> 	at org.elasticsearch.repositories.blobstore.BlobStoreRepository.latestIndexBlobId(BlobStoreRepository.java:755) ~[main/:?]
20:51:14   1> 	at org.elasticsearch.repositories.blobstore.BlobStoreRepository.getRepositoryData(BlobStoreRepository.java:607) ~[main/:?]
20:51:14   1> 	... 10 more
20:51:14   1> [2017-09-02T06:51:05,705][INFO ][o.e.s.SharedClusterSnapshotRestoreIT]

reproduce with:

gradle :core:integTest \
  -Dtests.seed=3BFDC6F19B3167C4 \
  -Dtests.class=org.elasticsearch.snapshots.SharedClusterSnapshotRestoreIT \
  -Dtests.method="testGetSnapshotsRequest" \
  -Dtests.security.manager=true \
  -Dtests.locale=mk \
  -Dtests.timezone=Australia/Brisbane

The text was updated successfully, but these errors were encountered:

andyb-elastic · 2017-09-13T17:49:14Z

Another one https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.0+multijob-unix-compatibility/os=centos/156

droberts195 · 2017-09-21T15:15:27Z

The same test failed today for both 6.0 and 6.x because it stalled.

In https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+periodic/68/consoleFull we have:

14:52:26 HEARTBEAT J0 PID([email protected]): 2017-09-21T14:52:26, stalled for 1171s at: SharedClusterSnapshotRestoreIT.testGetSnapshotsRequest

Then the suite gets killed after it's been running for 20 minutes.

In https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.0+oracle-periodic/668/consoleFull we have:

08:05:48 HEARTBEAT J1 PID(8923@slave-709587): 2017-09-21T10:05:48, stalled for 1171s at: SharedClusterSnapshotRestoreIT.testGetSnapshotsRequest

The REPRO commands are:

gradle :core:integTest \
  -Dtests.seed=C3CEF376594E8DDB \
  -Dtests.class=org.elasticsearch.snapshots.SharedClusterSnapshotRestoreIT \
  -Dtests.method="testGetSnapshotsRequest" \
  -Dtests.security.manager=true \
  -Dtests.locale=el \
  -Dtests.timezone=Etc/GMT+2

and:

gradle :core:integTest \
  -Dtests.seed=719ADA6BD509A3A1 \
  -Dtests.class=org.elasticsearch.snapshots.SharedClusterSnapshotRestoreIT \
  -Dtests.method="testGetSnapshotsRequest" \
  -Dtests.security.manager=true \
  -Dtests.locale=ar-QA \
  -Dtests.timezone=Africa/Sao_Tome

These don't reproduce the problem locally for me.

(It's a different failure to the original issue description, although the same test. If you think it's not related in any way I'm happy to move the stalls into a separate issue.)

ywelsch · 2017-09-22T07:59:39Z

@droberts195 The root cause for the failures that you have observed is that the following assertion tripped (it's always good to first grep for "AssertionError" in the logs):

java.lang.AssertionError
	at __randomizedtesting.SeedInfo.seed([719ADA6BD509A3A1]:0)
	at org.elasticsearch.cluster.SnapshotsInProgress$ShardSnapshotStatus.<init>(SnapshotsInProgress.java:257)
	at org.elasticsearch.snapshots.SnapshotShardsService.processIndexShardSnapshots(SnapshotShardsService.java:294)
	at org.elasticsearch.snapshots.SnapshotShardsService.applyClusterState(SnapshotShardsService.java:164)
	at org.elasticsearch.cluster.service.ClusterApplierService.lambda$callClusterStateAppliers$6(ClusterApplierService.java:495)
	at java.lang.Iterable.forEach(Iterable.java:75)
	at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:492)
	at org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:479)
	at org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:429)
	at org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:158)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569)
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:247)
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:210)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

This in return killed the cluster state applier thread on that node, leading to the suite timeout as the node could not make meaningful progress anymore.

ywelsch · 2017-09-22T08:20:53Z

The reason for this failure is that IndexShardSnapshotStatus is not consistently updated.

For example, we have code that does

snapshotStatus.updateStage(IndexShardSnapshotStatus.Stage.FAILURE);
snapshotStatus.failure(ExceptionsHelper.detailedMessage(e));

i.e. these two are separate actions, not atomic.
Another thread comes along (the one that trips the assertion above) and reads the stage, sees it as failed, and expects there to be a failure message. That one might not be set yet, however (Also note that the failure message field is not even volatile, so another issue).

I think the IndexShardSnapshotStatus needs to be reworked to atomically update its state (synchronized methods updating multiple fields at a time).

I'm reassigning this to @imotov as I don't think that @talevy identified the root cause (the logs are not available anymore). The logs from the failure reported by @andy-elastic are available, and show the same root cause as what I wrote above.

cbuescher · 2017-11-02T14:05:26Z

This one from today on the 5.6 branch looks also related:

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+5.6+multijob-unix-compatibility/os=amazon/305/consoleFull

I see same AssertionError as @ywelsch mentioned:

13:30:44    > Throwable #1: java.lang.Exception: Suite timeout exceeded (>= 1200000 msec).
13:30:44    > 	at __randomizedtesting.SeedInfo.seed([B9C54E94C747A247]:0)Throwable #2: com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=15921, name=elasticsearch[node_s0][clusterService#updateTask][T#1], state=RUNNABLE, group=TGRP-SharedClusterSnapshotRestoreIT]
13:30:44    > Caused by: java.lang.AssertionError
13:30:44    > 	at __randomizedtesting.SeedInfo.seed([B9C54E94C747A247]:0)
13:30:44    > 	at org.elasticsearch.cluster.SnapshotsInProgress$ShardSnapshotStatus.<init>(SnapshotsInProgress.java:245)
13:30:44    > 	at org.elasticsearch.snapshots.SnapshotShardsService.processIndexShardSnapshots(SnapshotShardsService.java:295)
13:30:44    > 	at org.elasticsearch.snapshots.SnapshotShardsService.applyClusterState(SnapshotShardsService.java:165)
13:30:44    > 	at org.elasticsearch.cluster.service.ClusterService.callClusterStateAppliers(ClusterService.java:814)
13:30:44    > 	at org.elasticsearch.cluster.service.ClusterService.publishAndApplyChanges(ClusterService.java:768)
13:30:44    > 	at org.elasticsearch.cluster.service.ClusterService.runTasks(ClusterService.java:587)
13:30:44    > 	at org.elasticsearch.cluster.service.ClusterService$ClusterServiceTaskBatcher.run(ClusterService.java:2

markharwood · 2017-11-09T12:50:49Z

Same assertion tripped again:
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+5.6+periodic/649/console

name=elasticsearch[node_sd1][clusterService#updateTask][T#1], state=RUNNABLE, group=TGRP-SharedClusterSnapshotRestoreIT]
11:47:43    > 	at __randomizedtesting.SeedInfo.seed([86E7DEFF5CB28794:767D7EBDCCB3659]:0)
11:47:43    > Caused by: java.lang.AssertionError
11:47:43    > 	at __randomizedtesting.SeedInfo.seed([86E7DEFF5CB28794]:0)
11:47:43    > 	at org.elasticsearch.cluster.SnapshotsInProgress$ShardSnapshotStatus.<init>(SnapshotsInProgress.java:245)
11:47:43    > 	at org.elasticsearch.snapshots.SnapshotShardsService.processIndexShardSnapshots(SnapshotShardsService.java:295)
11:47:43    > 	at org.elasticsearch.snapshots.SnapshotShardsService.applyClusterState(SnapshotShardsService.java:165)
11:47:43    > 	at org.elasticsearch.cluster.service.ClusterService.callClusterStateAppliers(ClusterService.java:814)
11:47:43    > 	at org.elasticsearch.cluster.service.ClusterService.publishAndApplyChanges(ClusterService.java:768)
11:47:43    > 	at org.elasticsearch.cluster.service.ClusterService.runTasks(ClusterService.java:587)
11:47:43    > 	at org.elasticsearch.cluster.service.ClusterService$ClusterServiceTaskBatcher.run(ClusterService.java:263)
11:47:43    > 	at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150)

tlrx · 2017-11-13T09:28:13Z

This test failed today again:
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+oracle-periodic/1301

consoleText.txt

andyb-elastic · 2017-11-20T18:45:32Z

Another in master

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+intake/822/console

build-822-testGetSnapshotsRequest.txt

andyb-elastic · 2017-11-20T18:48:34Z

Another in 6.0

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.0+oracle-periodic/1373/console

build-1373-testGetSnapshotsRequest.txt

andyb-elastic · 2017-12-27T20:14:29Z

Another in 6.1

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.1+multijob-unix-compatibility/os=debian/114/console

build-114-SharedClusterSnapshotRestoreIT.testGetSnapshotsRequest.txt

jkakavas · 2017-12-28T11:01:57Z

Another one on master

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=oraclelinux/1841/console

cbuescher · 2018-01-09T10:50:50Z

More of these on 5.6: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+5.6+multijob-darwin-compatibility/528/console

This commit changes IndexShardSnapshotStatus so that the Stage is updated coherently with any required information. It also provides a asCopy() method that returns the status of a IndexShardSnapshotStatus at a given point in time, ensuring that all information are coherent. Closes #26480

cbuescher · 2018-01-26T13:22:31Z

@tlrx I just saw a similar looking failure on 6.1 today. Can you take a look if this looks related and either open this issue again or create a new one if you think it is something else?
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.1+periodic/550/console

ywelsch · 2018-01-26T13:26:52Z

The PR (#28130) was not backported to 6.1 (that was a conscious decision), so we can ignore that test failure.

talevy added >test Issues or PRs that are addressing/adding tests v7.0.0 labels Sep 2, 2017

talevy self-assigned this Sep 2, 2017

talevy added the >test-failure Triaged test failures from CI label Sep 2, 2017

ywelsch added v6.0.0 v6.1.0 labels Sep 22, 2017

ywelsch assigned imotov and unassigned talevy Sep 22, 2017

imotov assigned tlrx and unassigned imotov Nov 3, 2017

lcawl added v6.0.1 and removed v6.0.0 labels Nov 13, 2017

tlrx mentioned this issue Nov 28, 2017

Consistent update of stage and failure message in IndexShardSnapshotStatus #27557

Closed

lcawl added v6.0.2 v6.2.0 and removed v6.0.1 v6.1.0 labels Dec 6, 2017

jaymode added v6.0.3 and removed v6.0.2 labels Dec 13, 2017

tlrx mentioned this issue Jan 8, 2018

Consistent updates of IndexShardSnapshotStatus #28130

Merged

tlrx closed this as completed in #28130 Jan 9, 2018

tlrx mentioned this issue Jun 4, 2018

SharedClusterSnapshotRestoreIT#testGetSnapshotsRequest fails #31054

Closed

jimczi added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] SharedClusterSnapshotRestoreIT#testGetSnapshotsRequest failing #26480

[CI] SharedClusterSnapshotRestoreIT#testGetSnapshotsRequest failing #26480

talevy commented Sep 2, 2017

andyb-elastic commented Sep 13, 2017

droberts195 commented Sep 21, 2017

ywelsch commented Sep 22, 2017

ywelsch commented Sep 22, 2017

cbuescher commented Nov 2, 2017

markharwood commented Nov 9, 2017

tlrx commented Nov 13, 2017

andyb-elastic commented Nov 20, 2017

andyb-elastic commented Nov 20, 2017

andyb-elastic commented Dec 27, 2017

jkakavas commented Dec 28, 2017

cbuescher commented Jan 9, 2018

cbuescher commented Jan 26, 2018

ywelsch commented Jan 26, 2018

[CI] SharedClusterSnapshotRestoreIT#testGetSnapshotsRequest failing #26480

[CI] SharedClusterSnapshotRestoreIT#testGetSnapshotsRequest failing #26480

Comments

talevy commented Sep 2, 2017

andyb-elastic commented Sep 13, 2017

droberts195 commented Sep 21, 2017

ywelsch commented Sep 22, 2017

ywelsch commented Sep 22, 2017

cbuescher commented Nov 2, 2017

markharwood commented Nov 9, 2017

tlrx commented Nov 13, 2017

andyb-elastic commented Nov 20, 2017

andyb-elastic commented Nov 20, 2017

andyb-elastic commented Dec 27, 2017

jkakavas commented Dec 28, 2017

cbuescher commented Jan 9, 2018

cbuescher commented Jan 26, 2018

ywelsch commented Jan 26, 2018