Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] SharedClusterSnapshotRestoreIT#testGetSnapshotsRequest failing #26480

Closed
talevy opened this issue Sep 2, 2017 · 14 comments
Closed

[CI] SharedClusterSnapshotRestoreIT#testGetSnapshotsRequest failing #26480

talevy opened this issue Sep 2, 2017 · 14 comments
Assignees
Labels
>test Issues or PRs that are addressing/adding tests >test-failure Triaged test failures from CI v6.0.3 v6.2.0 v7.0.0-beta1

Comments

@talevy
Copy link
Contributor

talevy commented Sep 2, 2017

I feel that I am to blame for this because of the #26463, but I cannot reproduce locally

link:
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+oracle-periodic/605/console

trace

20:51:14   1> org.elasticsearch.repositories.RepositoryException: [test-repo] could not read repository data from index blob
20:51:14   1> 	at org.elasticsearch.repositories.blobstore.BlobStoreRepository.getRepositoryData(BlobStoreRepository.java:648) ~[main/:?]
20:51:14   1> 	at org.elasticsearch.snapshots.SnapshotsService.createSnapshot(SnapshotsService.java:236) ~[main/:?]
20:51:14   1> 	at org.elasticsearch.action.admin.cluster.snapshots.create.TransportCreateSnapshotAction.masterOperation(TransportCreateSnapshotAction.java:82) ~[main/:?]
20:51:14   1> 	at org.elasticsearch.action.admin.cluster.snapshots.create.TransportCreateSnapshotAction.masterOperation(TransportCreateSnapshotAction.java:41) ~[main/:?]
20:51:14   1> 	at org.elasticsearch.action.support.master.TransportMasterNodeAction.masterOperation(TransportMasterNodeAction.java:87) ~[main/:?]
20:51:14   1> 	at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$2.doRun(TransportMasterNodeAction.java:166) ~[main/:?]
20:51:14   1> 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) ~[main/:?]
20:51:14   1> 	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[main/:?]
20:51:14   1> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_144]
20:51:14   1> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_144]
20:51:14   1> 	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_144]
20:51:14   1> Caused by: java.io.IOException: Random IOException
20:51:14   1> 	at org.elasticsearch.snapshots.mockstore.MockRepository$MockBlobStore$MockBlobContainer.maybeIOExceptionOrBlock(MockRepository.java:276) ~[test/:?]
20:51:14   1> 	at org.elasticsearch.snapshots.mockstore.MockRepository$MockBlobStore$MockBlobContainer.listBlobsByPrefix(MockRepository.java:336) ~[test/:?]
20:51:14   1> 	at org.elasticsearch.repositories.blobstore.BlobStoreRepository.listBlobsToGetLatestIndexId(BlobStoreRepository.java:777) ~[main/:?]
20:51:14   1> 	at org.elasticsearch.repositories.blobstore.BlobStoreRepository.latestIndexBlobId(BlobStoreRepository.java:755) ~[main/:?]
20:51:14   1> 	at org.elasticsearch.repositories.blobstore.BlobStoreRepository.getRepositoryData(BlobStoreRepository.java:607) ~[main/:?]
20:51:14   1> 	... 10 more
20:51:14   1> [2017-09-02T06:51:05,705][INFO ][o.e.s.SharedClusterSnapshotRestoreIT] 

reproduce with:

gradle :core:integTest \
  -Dtests.seed=3BFDC6F19B3167C4 \
  -Dtests.class=org.elasticsearch.snapshots.SharedClusterSnapshotRestoreIT \
  -Dtests.method="testGetSnapshotsRequest" \
  -Dtests.security.manager=true \
  -Dtests.locale=mk \
  -Dtests.timezone=Australia/Brisbane
@talevy talevy added >test Issues or PRs that are addressing/adding tests v7.0.0 labels Sep 2, 2017
@talevy talevy self-assigned this Sep 2, 2017
@talevy talevy added the >test-failure Triaged test failures from CI label Sep 2, 2017
@andyb-elastic
Copy link
Contributor

@droberts195
Copy link
Contributor

The same test failed today for both 6.0 and 6.x because it stalled.

In https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+periodic/68/consoleFull we have:

14:52:26 HEARTBEAT J0 PID([email protected]): 2017-09-21T14:52:26, stalled for 1171s at: SharedClusterSnapshotRestoreIT.testGetSnapshotsRequest

Then the suite gets killed after it's been running for 20 minutes.

In https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.0+oracle-periodic/668/consoleFull we have:

08:05:48 HEARTBEAT J1 PID(8923@slave-709587): 2017-09-21T10:05:48, stalled for 1171s at: SharedClusterSnapshotRestoreIT.testGetSnapshotsRequest

The REPRO commands are:

gradle :core:integTest \
  -Dtests.seed=C3CEF376594E8DDB \
  -Dtests.class=org.elasticsearch.snapshots.SharedClusterSnapshotRestoreIT \
  -Dtests.method="testGetSnapshotsRequest" \
  -Dtests.security.manager=true \
  -Dtests.locale=el \
  -Dtests.timezone=Etc/GMT+2

and:

gradle :core:integTest \
  -Dtests.seed=719ADA6BD509A3A1 \
  -Dtests.class=org.elasticsearch.snapshots.SharedClusterSnapshotRestoreIT \
  -Dtests.method="testGetSnapshotsRequest" \
  -Dtests.security.manager=true \
  -Dtests.locale=ar-QA \
  -Dtests.timezone=Africa/Sao_Tome

These don't reproduce the problem locally for me.

(It's a different failure to the original issue description, although the same test. If you think it's not related in any way I'm happy to move the stalls into a separate issue.)

@ywelsch
Copy link
Contributor

ywelsch commented Sep 22, 2017

@droberts195 The root cause for the failures that you have observed is that the following assertion tripped (it's always good to first grep for "AssertionError" in the logs):

java.lang.AssertionError
	at __randomizedtesting.SeedInfo.seed([719ADA6BD509A3A1]:0)
	at org.elasticsearch.cluster.SnapshotsInProgress$ShardSnapshotStatus.<init>(SnapshotsInProgress.java:257)
	at org.elasticsearch.snapshots.SnapshotShardsService.processIndexShardSnapshots(SnapshotShardsService.java:294)
	at org.elasticsearch.snapshots.SnapshotShardsService.applyClusterState(SnapshotShardsService.java:164)
	at org.elasticsearch.cluster.service.ClusterApplierService.lambda$callClusterStateAppliers$6(ClusterApplierService.java:495)
	at java.lang.Iterable.forEach(Iterable.java:75)
	at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:492)
	at org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:479)
	at org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:429)
	at org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:158)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569)
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:247)
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:210)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

This in return killed the cluster state applier thread on that node, leading to the suite timeout as the node could not make meaningful progress anymore.

@ywelsch
Copy link
Contributor

ywelsch commented Sep 22, 2017

The reason for this failure is that IndexShardSnapshotStatus is not consistently updated.

For example, we have code that does

snapshotStatus.updateStage(IndexShardSnapshotStatus.Stage.FAILURE);
snapshotStatus.failure(ExceptionsHelper.detailedMessage(e));

i.e. these two are separate actions, not atomic.
Another thread comes along (the one that trips the assertion above) and reads the stage, sees it as failed, and expects there to be a failure message. That one might not be set yet, however (Also note that the failure message field is not even volatile, so another issue).

I think the IndexShardSnapshotStatus needs to be reworked to atomically update its state (synchronized methods updating multiple fields at a time).

I'm reassigning this to @imotov as I don't think that @talevy identified the root cause (the logs are not available anymore). The logs from the failure reported by @andy-elastic are available, and show the same root cause as what I wrote above.

@cbuescher
Copy link
Member

This one from today on the 5.6 branch looks also related:

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+5.6+multijob-unix-compatibility/os=amazon/305/consoleFull

I see same AssertionError as @ywelsch mentioned:

13:30:44    > Throwable #1: java.lang.Exception: Suite timeout exceeded (>= 1200000 msec).
13:30:44    > 	at __randomizedtesting.SeedInfo.seed([B9C54E94C747A247]:0)Throwable #2: com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=15921, name=elasticsearch[node_s0][clusterService#updateTask][T#1], state=RUNNABLE, group=TGRP-SharedClusterSnapshotRestoreIT]
13:30:44    > Caused by: java.lang.AssertionError
13:30:44    > 	at __randomizedtesting.SeedInfo.seed([B9C54E94C747A247]:0)
13:30:44    > 	at org.elasticsearch.cluster.SnapshotsInProgress$ShardSnapshotStatus.<init>(SnapshotsInProgress.java:245)
13:30:44    > 	at org.elasticsearch.snapshots.SnapshotShardsService.processIndexShardSnapshots(SnapshotShardsService.java:295)
13:30:44    > 	at org.elasticsearch.snapshots.SnapshotShardsService.applyClusterState(SnapshotShardsService.java:165)
13:30:44    > 	at org.elasticsearch.cluster.service.ClusterService.callClusterStateAppliers(ClusterService.java:814)
13:30:44    > 	at org.elasticsearch.cluster.service.ClusterService.publishAndApplyChanges(ClusterService.java:768)
13:30:44    > 	at org.elasticsearch.cluster.service.ClusterService.runTasks(ClusterService.java:587)
13:30:44    > 	at org.elasticsearch.cluster.service.ClusterService$ClusterServiceTaskBatcher.run(ClusterService.java:2

@imotov imotov assigned tlrx and unassigned imotov Nov 3, 2017
@markharwood
Copy link
Contributor

Same assertion tripped again:
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+5.6+periodic/649/console

name=elasticsearch[node_sd1][clusterService#updateTask][T#1], state=RUNNABLE, group=TGRP-SharedClusterSnapshotRestoreIT]
11:47:43    > 	at __randomizedtesting.SeedInfo.seed([86E7DEFF5CB28794:767D7EBDCCB3659]:0)
11:47:43    > Caused by: java.lang.AssertionError
11:47:43    > 	at __randomizedtesting.SeedInfo.seed([86E7DEFF5CB28794]:0)
11:47:43    > 	at org.elasticsearch.cluster.SnapshotsInProgress$ShardSnapshotStatus.<init>(SnapshotsInProgress.java:245)
11:47:43    > 	at org.elasticsearch.snapshots.SnapshotShardsService.processIndexShardSnapshots(SnapshotShardsService.java:295)
11:47:43    > 	at org.elasticsearch.snapshots.SnapshotShardsService.applyClusterState(SnapshotShardsService.java:165)
11:47:43    > 	at org.elasticsearch.cluster.service.ClusterService.callClusterStateAppliers(ClusterService.java:814)
11:47:43    > 	at org.elasticsearch.cluster.service.ClusterService.publishAndApplyChanges(ClusterService.java:768)
11:47:43    > 	at org.elasticsearch.cluster.service.ClusterService.runTasks(ClusterService.java:587)
11:47:43    > 	at org.elasticsearch.cluster.service.ClusterService$ClusterServiceTaskBatcher.run(ClusterService.java:263)
11:47:43    > 	at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150)

@tlrx
Copy link
Member

tlrx commented Nov 13, 2017

@lcawl lcawl added v6.0.1 and removed v6.0.0 labels Nov 13, 2017
@jkakavas
Copy link
Member

@cbuescher
Copy link
Member

tlrx added a commit that referenced this issue Jan 9, 2018
This commit changes IndexShardSnapshotStatus so that the Stage is updated
coherently with any required information. It also provides a asCopy()
method that returns the status of a IndexShardSnapshotStatus at a given
point in time, ensuring that all information are coherent.

Closes #26480
tlrx added a commit that referenced this issue Jan 15, 2018
This commit changes IndexShardSnapshotStatus so that the Stage is updated
coherently with any required information. It also provides a asCopy()
method that returns the status of a IndexShardSnapshotStatus at a given
point in time, ensuring that all information are coherent.

Closes #26480
@cbuescher
Copy link
Member

@tlrx I just saw a similar looking failure on 6.1 today. Can you take a look if this looks related and either open this issue again or create a new one if you think it is something else?
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.1+periodic/550/console

@ywelsch
Copy link
Contributor

ywelsch commented Jan 26, 2018

The PR (#28130) was not backported to 6.1 (that was a conscious decision), so we can ignore that test failure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>test Issues or PRs that are addressing/adding tests >test-failure Triaged test failures from CI v6.0.3 v6.2.0 v7.0.0-beta1
Projects
None yet
Development

No branches or pull requests