Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DedicatedClusterSnapshotRestoreIT#testRestoreShrinkIndex fails #38845

Closed
DaveCTurner opened this issue Feb 13, 2019 · 5 comments · Fixed by #38891
Closed

DedicatedClusterSnapshotRestoreIT#testRestoreShrinkIndex fails #38845

DaveCTurner opened this issue Feb 13, 2019 · 5 comments · Fixed by #38891
Assignees
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test-failure Triaged test failures from CI

Comments

@DaveCTurner
Copy link
Contributor

Possibly relates #38256, at least this is the same test that still seems to be failing in latest master (cacf81a).

The actual issue seems to be here:

  1> java.lang.NullPointerException: null
  1>    at org.elasticsearch.common.settings.IndexScopedSettings.<init>(IndexScopedSettings.java:187) ~[main/:?]
  1>    at org.elasticsearch.common.settings.IndexScopedSettings.copy(IndexScopedSettings.java:191) ~[main/:?]
  1>    at org.elasticsearch.index.IndexSettings.<init>(IndexSettings.java:434) ~[main/:?]
  1>    at org.elasticsearch.index.IndexSettings.<init>(IndexSettings.java:423) ~[main/:?]
  1>    at org.elasticsearch.indices.IndicesService.buildIndexSettings(IndicesService.java:940) ~[main/:?]
  1>    at org.elasticsearch.indices.IndicesService.verifyIndexIsDeleted(IndicesService.java:882) ~[main/:?]
  1>    at org.elasticsearch.indices.cluster.IndicesClusterStateService.deleteIndices(IndicesClusterStateService.java:338) ~[main/:?]
  1>    at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyClusterState(IndicesClusterStateService.java:258) ~[main/:?]
  1>    at org.elasticsearch.cluster.service.ClusterApplierService.lambda$callClusterStateAppliers$5(ClusterApplierService.java:472) ~[main/:?]
  1>    at java.lang.Iterable.forEach(Iterable.java:75) ~[?:?]
  1>    at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:470) ~[main/:?]
  1>    at org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:459) ~[main/:?]
  1>    at org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:413) [main/:?]
  1>    at org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:164) [main/:?]
  1>    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:681) [main/:?]
  1>    at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252) [main/:?]
  1>    at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215) [main/:?]
  1>    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
  1>    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
  1>    at java.lang.Thread.run(Thread.java:834) [?:?]

The trouble being, I think, that there's a race in which the index directory exists without any metadata, so we get a null here rather than an exception:

metaData = metaStateService.loadIndexState(index);

I think we should throw something there rather than passing null to buildIndexSettings(metaData); a few lines further down.

@DaveCTurner DaveCTurner added :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test-failure Triaged test failures from CI labels Feb 13, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@DaveCTurner
Copy link
Contributor Author

Logs from a recent failure:

testoutput-stderr.log.gz
testoutput-stdout.log.gz

@original-brownbear original-brownbear self-assigned this Feb 13, 2019
@original-brownbear
Copy link
Member

original-brownbear commented Feb 14, 2019

The trouble being, I think, that there's a race in which the index directory exists without any metadata, so we get a null here rather than an exception:

Yea, this seems to be it in part. But I'm also trying to understand why we fail to delete the index in the first place with a bunch of:

access denied: /tmp/org.elasticsearch.snapshots.DedicatedClusterSnapshotRestoreIT_8F6CAD63F58B768B-009/tempDir-002/d0/nodes/1/indices/8FF6SE6JTC2tluMQ5aiTTw/4/index/_1.cfs

The error we get on delete is an access denied that is the result of the Lucene mock directory still (wrongfully?) thinking that some files that should be deleted are open. The issue looks somewhat similar to this one. I can certainly delete the files that can't be deleted according to the directory wrapper straight from the file system, so there doesn't seem to be any actual FD open for them. Looking into that now :)

Update: seems like the race is expected during deletes and we should just handle it better ... on that.

original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Feb 14, 2019
* We should treat a `null` return for the metadata as equal to an error and break out
  * Added the check at this level even though it required nested `throw`, because adding it further downstream would impact other functionality
* Closes elastic#38845
@droberts195
Copy link
Contributor

droberts195 commented Feb 14, 2019

The same thing happened in 7.x in https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.x+intake/100/console

  1> [2019-02-14T03:41:40,299][WARN ][o.e.c.s.ClusterApplierService] [node_td2] failed to apply updated cluster state in [0s]:
  1> version [22], uuid [70l8KasyQGyyxCgAvQnikQ], source [ApplyCommitRequest{term=1, version=22, sourceNode={node_tm0}{RwJasFHGSY-MWz-3pE8GoA}{aerWgkWqQTilEHyp07ULcA}{127.0.0.1}{127.0.0.1:37020}}]
  1> java.lang.NullPointerException: null
  1> 	at org.elasticsearch.common.settings.IndexScopedSettings.<init>(IndexScopedSettings.java:187) ~[main/:?]
  1> 	at org.elasticsearch.common.settings.IndexScopedSettings.copy(IndexScopedSettings.java:191) ~[main/:?]
  1> 	at org.elasticsearch.index.IndexSettings.<init>(IndexSettings.java:434) ~[main/:?]
  1> 	at org.elasticsearch.index.IndexSettings.<init>(IndexSettings.java:423) ~[main/:?]
  1> 	at org.elasticsearch.indices.IndicesService.buildIndexSettings(IndicesService.java:940) ~[main/:?]
  1> 	at org.elasticsearch.indices.IndicesService.verifyIndexIsDeleted(IndicesService.java:882) ~[main/:?]
  1> 	at org.elasticsearch.indices.cluster.IndicesClusterStateService.deleteIndices(IndicesClusterStateService.java:338) ~[main/:?]
  1> 	at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyClusterState(IndicesClusterStateService.java:258) ~[main/:?]
  1> 	at org.elasticsearch.cluster.service.ClusterApplierService.lambda$callClusterStateAppliers$5(ClusterApplierService.java:472) ~[main/:?]
  1> 	at java.lang.Iterable.forEach(Iterable.java:75) ~[?:1.8.0_202]
  1> 	at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:470) ~[main/:?]
  1> 	at org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:459) ~[main/:?]
  1> 	at org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:413) [main/:?]
  1> 	at org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:164) [main/:?]
  1> 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:681) [main/:?]
  1> 	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252) [main/:?]
  1> 	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215) [main/:?]
  1> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_202]
  1> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_202]
  1> 	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_202]

The repro command is:

./gradlew :server:integTest \
  -Dtests.seed=B07B819FA25ED3BA \
  -Dtests.class=org.elasticsearch.snapshots.DedicatedClusterSnapshotRestoreIT \
  -Dtests.method="testRestoreShrinkIndex" \
  -Dtests.security.manager=true \
  -Dtests.locale=ar-AE \
  -Dtests.timezone=MST7MDT \
  -Dcompiler.java=11 \
  -Druntime.java=8

This reproduced locally for me on a CentOS 7 server:

ERROR   99.0s | DedicatedClusterSnapshotRestoreIT.testRestoreShrinkIndex <<< FAILURES!
   > Throwable #1: MasterNotDiscoveredException[null]
   >    at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$4.onTimeout(TransportMasterNodeAction.java:259)
   >    at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:322)
   >    at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:249)
   >    at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:549)
   >    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:681)
   >    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
   >    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
   >    at java.lang.Thread.run(Thread.java:745)Throwable #2: MasterNotDiscoveredException[null]
   >    at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$4.onTimeout(TransportMasterNodeAction.java:259)
   >    at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:322)
   >    at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:249)
   >    at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:549)
   >    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:681)
   >    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
   >    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
   >    at java.lang.Thread.run(Thread.java:745)

I muted the test in 7.x in 6ea483a

@original-brownbear
Copy link
Member

Fix incoming in #38891 :)

jkakavas pushed a commit to jkakavas/elasticsearch that referenced this issue Feb 20, 2019
jkakavas pushed a commit to jkakavas/elasticsearch that referenced this issue Feb 20, 2019
jkakavas added a commit that referenced this issue Feb 20, 2019
jkakavas added a commit that referenced this issue Feb 20, 2019
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Apr 9, 2019
cbuescher pushed a commit to cbuescher/elasticsearch that referenced this issue Jul 31, 2019
The issue mentioned (elastic#38845) seems to have been closed with elastic#38891 so the test
can be re-activated.
cbuescher pushed a commit that referenced this issue Jul 31, 2019
The issue mentioned (#38845) seems to have been closed with #38891 so the test
can be re-activated.
cbuescher pushed a commit that referenced this issue Jul 31, 2019
The issue mentioned (#38845) seems to have been closed with #38891 so the test
can be re-activated.
cbuescher pushed a commit that referenced this issue Jul 31, 2019
The issue mentioned (#38845) seems to have been closed with #38891 so the test
can be re-activated.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants