DedicatedClusterSnapshotRestoreIT#testRestoreShrinkIndex fails #38845

DaveCTurner · 2019-02-13T13:29:14Z

Possibly relates #38256, at least this is the same test that still seems to be failing in latest master (cacf81a).

The actual issue seems to be here:

  1> java.lang.NullPointerException: null
  1>    at org.elasticsearch.common.settings.IndexScopedSettings.<init>(IndexScopedSettings.java:187) ~[main/:?]
  1>    at org.elasticsearch.common.settings.IndexScopedSettings.copy(IndexScopedSettings.java:191) ~[main/:?]
  1>    at org.elasticsearch.index.IndexSettings.<init>(IndexSettings.java:434) ~[main/:?]
  1>    at org.elasticsearch.index.IndexSettings.<init>(IndexSettings.java:423) ~[main/:?]
  1>    at org.elasticsearch.indices.IndicesService.buildIndexSettings(IndicesService.java:940) ~[main/:?]
  1>    at org.elasticsearch.indices.IndicesService.verifyIndexIsDeleted(IndicesService.java:882) ~[main/:?]
  1>    at org.elasticsearch.indices.cluster.IndicesClusterStateService.deleteIndices(IndicesClusterStateService.java:338) ~[main/:?]
  1>    at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyClusterState(IndicesClusterStateService.java:258) ~[main/:?]
  1>    at org.elasticsearch.cluster.service.ClusterApplierService.lambda$callClusterStateAppliers$5(ClusterApplierService.java:472) ~[main/:?]
  1>    at java.lang.Iterable.forEach(Iterable.java:75) ~[?:?]
  1>    at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:470) ~[main/:?]
  1>    at org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:459) ~[main/:?]
  1>    at org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:413) [main/:?]
  1>    at org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:164) [main/:?]
  1>    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:681) [main/:?]
  1>    at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252) [main/:?]
  1>    at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215) [main/:?]
  1>    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
  1>    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
  1>    at java.lang.Thread.run(Thread.java:834) [?:?]

The trouble being, I think, that there's a race in which the index directory exists without any metadata, so we get a null here rather than an exception:

elasticsearch/server/src/main/java/org/elasticsearch/indices/IndicesService.java

Line 876 in cacf81a

metaData = metaStateService.loadIndexState(index);

I think we should throw something there rather than passing null to buildIndexSettings(metaData); a few lines further down.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-02-13T13:29:38Z

Pinging @elastic/es-distributed

DaveCTurner · 2019-02-13T13:30:04Z

Logs from a recent failure:

testoutput-stderr.log.gz
testoutput-stdout.log.gz

original-brownbear · 2019-02-14T08:38:10Z

The trouble being, I think, that there's a race in which the index directory exists without any metadata, so we get a null here rather than an exception:

Yea, this seems to be it in part. But I'm also trying to understand why we fail to delete the index in the first place with a bunch of:

access denied: /tmp/org.elasticsearch.snapshots.DedicatedClusterSnapshotRestoreIT_8F6CAD63F58B768B-009/tempDir-002/d0/nodes/1/indices/8FF6SE6JTC2tluMQ5aiTTw/4/index/_1.cfs

The error we get on delete is an access denied that is the result of the Lucene mock directory still (wrongfully?) thinking that some files that should be deleted are open. The issue looks somewhat similar to this one. I can certainly delete the files that can't be deleted according to the directory wrapper straight from the file system, so there doesn't seem to be any actual FD open for them. Looking into that now :)

Update: seems like the race is expected during deletes and we should just handle it better ... on that.

* We should treat a `null` return for the metadata as equal to an error and break out * Added the check at this level even though it required nested `throw`, because adding it further downstream would impact other functionality * Closes elastic#38845

droberts195 · 2019-02-14T11:46:59Z

The same thing happened in 7.x in https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.x+intake/100/console

  1> [2019-02-14T03:41:40,299][WARN ][o.e.c.s.ClusterApplierService] [node_td2] failed to apply updated cluster state in [0s]:
  1> version [22], uuid [70l8KasyQGyyxCgAvQnikQ], source [ApplyCommitRequest{term=1, version=22, sourceNode={node_tm0}{RwJasFHGSY-MWz-3pE8GoA}{aerWgkWqQTilEHyp07ULcA}{127.0.0.1}{127.0.0.1:37020}}]
  1> java.lang.NullPointerException: null
  1> 	at org.elasticsearch.common.settings.IndexScopedSettings.<init>(IndexScopedSettings.java:187) ~[main/:?]
  1> 	at org.elasticsearch.common.settings.IndexScopedSettings.copy(IndexScopedSettings.java:191) ~[main/:?]
  1> 	at org.elasticsearch.index.IndexSettings.<init>(IndexSettings.java:434) ~[main/:?]
  1> 	at org.elasticsearch.index.IndexSettings.<init>(IndexSettings.java:423) ~[main/:?]
  1> 	at org.elasticsearch.indices.IndicesService.buildIndexSettings(IndicesService.java:940) ~[main/:?]
  1> 	at org.elasticsearch.indices.IndicesService.verifyIndexIsDeleted(IndicesService.java:882) ~[main/:?]
  1> 	at org.elasticsearch.indices.cluster.IndicesClusterStateService.deleteIndices(IndicesClusterStateService.java:338) ~[main/:?]
  1> 	at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyClusterState(IndicesClusterStateService.java:258) ~[main/:?]
  1> 	at org.elasticsearch.cluster.service.ClusterApplierService.lambda$callClusterStateAppliers$5(ClusterApplierService.java:472) ~[main/:?]
  1> 	at java.lang.Iterable.forEach(Iterable.java:75) ~[?:1.8.0_202]
  1> 	at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:470) ~[main/:?]
  1> 	at org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:459) ~[main/:?]
  1> 	at org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:413) [main/:?]
  1> 	at org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:164) [main/:?]
  1> 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:681) [main/:?]
  1> 	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252) [main/:?]
  1> 	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215) [main/:?]
  1> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_202]
  1> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_202]
  1> 	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_202]

The repro command is:

./gradlew :server:integTest \
  -Dtests.seed=B07B819FA25ED3BA \
  -Dtests.class=org.elasticsearch.snapshots.DedicatedClusterSnapshotRestoreIT \
  -Dtests.method="testRestoreShrinkIndex" \
  -Dtests.security.manager=true \
  -Dtests.locale=ar-AE \
  -Dtests.timezone=MST7MDT \
  -Dcompiler.java=11 \
  -Druntime.java=8

This reproduced locally for me on a CentOS 7 server:

ERROR   99.0s | DedicatedClusterSnapshotRestoreIT.testRestoreShrinkIndex <<< FAILURES!
   > Throwable #1: MasterNotDiscoveredException[null]
   >    at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$4.onTimeout(TransportMasterNodeAction.java:259)
   >    at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:322)
   >    at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:249)
   >    at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:549)
   >    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:681)
   >    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
   >    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
   >    at java.lang.Thread.run(Thread.java:745)Throwable #2: MasterNotDiscoveredException[null]
   >    at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$4.onTimeout(TransportMasterNodeAction.java:259)
   >    at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:322)
   >    at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:249)
   >    at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:549)
   >    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:681)
   >    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
   >    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
   >    at java.lang.Thread.run(Thread.java:745)

I muted the test in 7.x in 6ea483a

Due to #38845

original-brownbear · 2019-02-14T11:48:56Z

Fix incoming in #38891 :)

* Closes #38845

* Closes elastic#38845

This is a backport of #38891 which closes #38845

* Closes elastic#38845

* Closes #38845

The issue mentioned (elastic#38845) seems to have been closed with elastic#38891 so the test can be re-activated.

The issue mentioned (#38845) seems to have been closed with #38891 so the test can be re-activated.

DaveCTurner added :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test-failure Triaged test failures from CI labels Feb 13, 2019

original-brownbear self-assigned this Feb 13, 2019

original-brownbear mentioned this issue Feb 14, 2019

Fix NPE on Stale Index in IndicesService #38891

Merged

droberts195 added a commit that referenced this issue Feb 14, 2019

Mute DedicatedClusterSnapshotRestoreIT testRestoreShrinkIndex

6ea483a

Due to #38845

original-brownbear closed this as completed in #38891 Feb 15, 2019

original-brownbear added a commit that referenced this issue Feb 15, 2019

Fix NPE on Stale Index in IndicesService (#38891)

d10fa1c

* Closes #38845

jkakavas pushed a commit to jkakavas/elasticsearch that referenced this issue Feb 20, 2019

Fix NPE on Stale Index in IndicesService (elastic#38891)

a678831

* Closes elastic#38845

jkakavas pushed a commit to jkakavas/elasticsearch that referenced this issue Feb 20, 2019

Fix NPE on Stale Index in IndicesService (elastic#38891)

80abb19

* Closes elastic#38845

This was referenced Feb 20, 2019

[BACKPORT 7.x] Fix NPE on Stale Index in IndicesService (#38891) #39173

Merged

[BACKPORT 7.0] Fix NPE on Stale Index in IndicesService (#38891) #39174

Merged

jkakavas added a commit that referenced this issue Feb 20, 2019

Fix NPE on Stale Index in IndicesService(#39173)

c783069

This is a backport of #38891 which closes #38845

jkakavas added a commit that referenced this issue Feb 20, 2019

Fix NPE on Stale Index in IndicesService (#39174)

6eeccd8

This is a backport of #38891 which closes #38845

original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Apr 9, 2019

Fix NPE on Stale Index in IndicesService (elastic#38891)

23e7c59

* Closes elastic#38845

original-brownbear mentioned this issue Apr 9, 2019

Fix NPE on Stale Index in IndicesService (#38891) #41012

Merged

original-brownbear added a commit that referenced this issue Apr 9, 2019

Fix NPE on Stale Index in IndicesService (#38891) (#41012)

e9dfb5b

* Closes #38845

cbuescher pushed a commit to cbuescher/elasticsearch that referenced this issue Jul 31, 2019

Remove left-over AwaitsFix in DedicatedClusterSnapshotRestoreIT

c06cb44

The issue mentioned (elastic#38845) seems to have been closed with elastic#38891 so the test can be re-activated.

cbuescher mentioned this issue Jul 31, 2019

Remove left-over AwaitsFix in DedicatedClusterSnapshotRestoreIT #45042

Merged

cbuescher pushed a commit that referenced this issue Jul 31, 2019

Remove left-over AwaitsFix in DedicatedClusterSnapshotRestoreIT (#45042)

e85b53a

The issue mentioned (#38845) seems to have been closed with #38891 so the test can be re-activated.

cbuescher pushed a commit that referenced this issue Jul 31, 2019

Remove left-over AwaitsFix in DedicatedClusterSnapshotRestoreIT (#45042)

e99761c

The issue mentioned (#38845) seems to have been closed with #38891 so the test can be re-activated.

cbuescher pushed a commit that referenced this issue Jul 31, 2019

Remove left-over AwaitsFix in DedicatedClusterSnapshotRestoreIT (#45042)

abe4f76

The issue mentioned (#38845) seems to have been closed with #38891 so the test can be re-activated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DedicatedClusterSnapshotRestoreIT#testRestoreShrinkIndex fails #38845

DedicatedClusterSnapshotRestoreIT#testRestoreShrinkIndex fails #38845

DaveCTurner commented Feb 13, 2019

elasticmachine commented Feb 13, 2019

DaveCTurner commented Feb 13, 2019

original-brownbear commented Feb 14, 2019 •

edited

Loading

droberts195 commented Feb 14, 2019 •

edited

Loading

original-brownbear commented Feb 14, 2019

DedicatedClusterSnapshotRestoreIT#testRestoreShrinkIndex fails #38845

DedicatedClusterSnapshotRestoreIT#testRestoreShrinkIndex fails #38845

Comments

DaveCTurner commented Feb 13, 2019

elasticmachine commented Feb 13, 2019

DaveCTurner commented Feb 13, 2019

original-brownbear commented Feb 14, 2019 • edited Loading

droberts195 commented Feb 14, 2019 • edited Loading

original-brownbear commented Feb 14, 2019

original-brownbear commented Feb 14, 2019 •

edited

Loading

droberts195 commented Feb 14, 2019 •

edited

Loading