Better handling of ancient indices #44230

DaveCTurner · 2019-07-11T15:13:35Z

If you create an index in a one-node 5.6 cluster, close it, then upgrade this node to 6.8 and then again to 7.2 without removing the 5.6 index then the node fails to properly fail. Instead, it goes into a loop of repeatedly winning the election and then failing the first publication and then trying again:

[2019-07-11T15:47:19,716][INFO ][o.e.c.s.MasterService    ] [node-0] elected-as-master ([1] nodes joined)[{node-0}{5B4rSbAnRTG5lhS9xWl8pw}{rBQ-DzKKRfCO77PmO_v2TQ}{127.0.0.1}{127.0.0.1:9300}{ml.machine_memory=17179869184, xpack.installed=true, ml.max_open_jobs=20} elect leader, _BECOME_MASTER_TASK_, _FINISH_ELECTION_], term: 1, version: 1, reason: maste
r node changed {previous [], current [{node-0}{5B4rSbAnRTG5lhS9xWl8pw}{rBQ-DzKKRfCO77PmO_v2TQ}{127.0.0.1}{127.0.0.1:9300}{ml.machine_memory=17179869184, xpack.installed=true, ml.max_open_jobs=20}]}
[2019-07-11T15:47:19,728][WARN ][o.e.c.s.MasterService    ] [node-0] failing [elected-as-master ([1] nodes joined)[{node-0}{5B4rSbAnRTG5lhS9xWl8pw}{rBQ-DzKKRfCO77PmO_v2TQ}{127.0.0.1}{127.0.0.1:9300}{ml.machine_memory=17179869184, xpack.installed=true, ml.max_open_jobs=20} elect leader, _BECOME_MASTER_TASK_, _FINISH_ELECTION_]]: failed to commit cluster
 state version [1]
org.elasticsearch.cluster.coordination.FailedToCommitClusterStateException: publication failed
        at org.elasticsearch.cluster.coordination.Coordinator$CoordinatorPublication$3.onFailure(Coordinator.java:1353) ~[elasticsearch-7.2.0.jar:7.2.0]
        at org.elasticsearch.common.util.concurrent.ListenableFuture$1.run(ListenableFuture.java:101) ~[elasticsearch-7.2.0.jar:7.2.0]
        at org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:193) ~[elasticsearch-7.2.0.jar:7.2.0]
        at org.elasticsearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:92) ~[elasticsearch-7.2.0.jar:7.2.0]
        at org.elasticsearch.common.util.concurrent.ListenableFuture.addListener(ListenableFuture.java:54) ~[elasticsearch-7.2.0.jar:7.2.0]
        at org.elasticsearch.cluster.coordination.Coordinator$CoordinatorPublication.onCompletion(Coordinator.java:1293) ~[elasticsearch-7.2.0.jar:7.2.0]
        at org.elasticsearch.cluster.coordination.Publication.onPossibleCompletion(Publication.java:124) ~[elasticsearch-7.2.0.jar:7.2.0]
        at org.elasticsearch.cluster.coordination.Publication.onPossibleCommitFailure(Publication.java:172) ~[elasticsearch-7.2.0.jar:7.2.0]
        at org.elasticsearch.cluster.coordination.Publication.access$600(Publication.java:41) ~[elasticsearch-7.2.0.jar:7.2.0]
        at org.elasticsearch.cluster.coordination.Publication$PublicationTarget$PublishResponseHandler.onFailure(Publication.java:348) ~[elasticsearch-7.2.0.jar:7.2.0]
        at org.elasticsearch.cluster.coordination.Coordinator$6.onFailure(Coordinator.java:1080) ~[elasticsearch-7.2.0.jar:7.2.0]
        at org.elasticsearch.cluster.coordination.PublicationTransportHandler$2$1.onFailure(PublicationTransportHandler.java:194) ~[elasticsearch-7.2.0.jar:7.2.0]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.onFailure(ThreadContext.java:743) ~[elasticsearch-7.2.0.jar:7.2.0]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:39) ~[elasticsearch-7.2.0.jar:7.2.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:835) [?:?]
Caused by: java.lang.IllegalStateException: index [i/EW17gwGGT5KefG_28xQrcQ] version not supported: 5.6.16 minimum compatible index version is: 6.0.0-beta1
        at org.elasticsearch.cluster.coordination.JoinTaskExecutor.ensureIndexCompatibility(JoinTaskExecutor.java:238) ~[elasticsearch-7.2.0.jar:7.2.0]
        at org.elasticsearch.cluster.coordination.JoinTaskExecutor.lambda$addBuiltInJoinValidators$0(JoinTaskExecutor.java:281) ~[elasticsearch-7.2.0.jar:7.2.0]
        at org.elasticsearch.cluster.coordination.Coordinator.lambda$handlePublishRequest$2(Coordinator.java:313) ~[elasticsearch-7.2.0.jar:7.2.0]
        at java.util.ArrayList.forEach(ArrayList.java:1540) ~[?:?]
        at java.util.Collections$UnmodifiableCollection.forEach(Collections.java:1083) ~[?:?]
        at org.elasticsearch.cluster.coordination.Coordinator.handlePublishRequest(Coordinator.java:313) ~[elasticsearch-7.2.0.jar:7.2.0]
        at org.elasticsearch.cluster.coordination.PublicationTransportHandler$2$1.doRun(PublicationTransportHandler.java:199) ~[elasticsearch-7.2.0.jar:7.2.0]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:758) ~[elasticsearch-7.2.0.jar:7.2.0]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.2.0.jar:7.2.0]
        ... 3 more

The node should fail earlier, and harder, than it does today. But it's a bit trickier than that: what does the user do once their upgraded node is refusing to start? By the time we can have noticed we've got a bad index version we will already have constructed the NodeEnvironment, and therefore written stuff to disk, and that means a downgrade is now unsafe. (#41731 would positively block a subsequent downgrade in a similar situation with a 6.x -> 7.x -> 8.x double-upgrade).

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-07-11T15:13:37Z

Pinging @elastic/es-distributed

Today we fail the node at startup if it contains an index that is too old to be compatible with the current version, unless that index is closed. If the index is closed then the node will start up and this puts us into a bad state: the index cannot be opened and must be reindexed using an earlier version, but we offer no way to get that index into a node running an earlier version so that it can be reindexed. Downgrading the node in-place is decidedly unsupported and cannot be expected to work since the node already started up and upgraded the rest of its metadata. Since elastic#41731 we actively reject downgrades to versions ≥ v7.2.0 too. This commit prevents the node from starting in the presence of any too-old indices (closed or not). In particular, it does not write any upgraded metadata in this situation, increasing the chances an in-place downgrade might be successful. We still actively reject the downgrade using elastic#41731, because we wrote the node metadata file before checking the index metadata, but at least there is a way to override this check. Relates elastic#21830, elastic#44230

DaveCTurner · 2019-07-12T09:03:49Z

On closer inspection the node only starts up at all if the ancient index is closed, and I think that's an oversight since the restriction to check only open indices was added despite deciding that it should apply to all indices in the conversation on the PR in question (there was another only-open-indices check in that PR which was removed in a later commit). I opened #44264 to address this.

My comment above about it being tricky still stands: we've written the node metadata file to disk before failing, so there isn't a truly supported (i.e. covered-by-tests) way forwards once you've hit this.

Today we fail the node at startup if it contains an index that is too old to be compatible with the current version, unless that index is closed. If the index is closed then the node will start up and this puts us into a bad state: the index cannot be opened and must be reindexed using an earlier version, but we offer no way to get that index into a node running an earlier version so that it can be reindexed. Downgrading the node in-place is decidedly unsupported and cannot be expected to work since the node already started up and upgraded the rest of its metadata. Since #41731 we actively reject downgrades to versions ≥ v7.2.0 too. This commit prevents the node from starting in the presence of any too-old indices (closed or not). In particular, it does not write any upgraded metadata in this situation, increasing the chances an in-place downgrade might be successful. We still actively reject the downgrade using #41731, because we wrote the node metadata file before checking the index metadata, but at least there is a way to override this check. Relates #21830, #44230

DaveCTurner · 2019-07-17T13:29:12Z

We discussed this as a team today and are broadly ok with the (untested) expectation that each version will be able to read the node metadata file from future versions. Thus failing a node after writing the node metadata file (but before writing anything else) is still not too late to downgrade back to a known-good version to delete the broken index before trying the upgrade again.

This means that we think that #44264 should be backported to 7.x: it is only a breaking change to single-node clusters, and it is recoverable given the reasoning above.

DaveCTurner · 2019-07-17T17:00:49Z

the (untested) expectation that each version will be able to read the node metadata file from future versions

Ok I tested this expectation and it's wrong. If you need to downgrade from ≥7.2 to <7.2 then you will need to delete the node metadata file, which'll generate a new node ID, and therefore you might also need to run elasticsearch-node unsafe-bootstrap.

DaveCTurner · 2019-07-19T13:08:39Z

Closing this in favour of #44624, since the fundamental problem is more general than the one described here.

DaveCTurner added >bug :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. labels Jul 11, 2019

DaveCTurner added the team-discuss label Jul 11, 2019

DaveCTurner mentioned this issue Jul 12, 2019

Fail node containing ancient closed index #44264

Merged

DaveCTurner closed this as completed Jul 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better handling of ancient indices #44230

Better handling of ancient indices #44230

DaveCTurner commented Jul 11, 2019 •

edited

Loading

elasticmachine commented Jul 11, 2019

DaveCTurner commented Jul 12, 2019

DaveCTurner commented Jul 17, 2019

DaveCTurner commented Jul 17, 2019

DaveCTurner commented Jul 19, 2019

Better handling of ancient indices #44230

Better handling of ancient indices #44230

Comments

DaveCTurner commented Jul 11, 2019 • edited Loading

elasticmachine commented Jul 11, 2019

DaveCTurner commented Jul 12, 2019

DaveCTurner commented Jul 17, 2019

DaveCTurner commented Jul 17, 2019

DaveCTurner commented Jul 19, 2019

DaveCTurner commented Jul 11, 2019 •

edited

Loading