Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better handling of ancient indices #44230

Closed
DaveCTurner opened this issue Jul 11, 2019 · 5 comments
Closed

Better handling of ancient indices #44230

DaveCTurner opened this issue Jul 11, 2019 · 5 comments
Labels
>bug :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. team-discuss

Comments

@DaveCTurner
Copy link
Contributor

DaveCTurner commented Jul 11, 2019

If you create an index in a one-node 5.6 cluster, close it, then upgrade this node to 6.8 and then again to 7.2 without removing the 5.6 index then the node fails to properly fail. Instead, it goes into a loop of repeatedly winning the election and then failing the first publication and then trying again:

[2019-07-11T15:47:19,716][INFO ][o.e.c.s.MasterService    ] [node-0] elected-as-master ([1] nodes joined)[{node-0}{5B4rSbAnRTG5lhS9xWl8pw}{rBQ-DzKKRfCO77PmO_v2TQ}{127.0.0.1}{127.0.0.1:9300}{ml.machine_memory=17179869184, xpack.installed=true, ml.max_open_jobs=20} elect leader, _BECOME_MASTER_TASK_, _FINISH_ELECTION_], term: 1, version: 1, reason: maste
r node changed {previous [], current [{node-0}{5B4rSbAnRTG5lhS9xWl8pw}{rBQ-DzKKRfCO77PmO_v2TQ}{127.0.0.1}{127.0.0.1:9300}{ml.machine_memory=17179869184, xpack.installed=true, ml.max_open_jobs=20}]}
[2019-07-11T15:47:19,728][WARN ][o.e.c.s.MasterService    ] [node-0] failing [elected-as-master ([1] nodes joined)[{node-0}{5B4rSbAnRTG5lhS9xWl8pw}{rBQ-DzKKRfCO77PmO_v2TQ}{127.0.0.1}{127.0.0.1:9300}{ml.machine_memory=17179869184, xpack.installed=true, ml.max_open_jobs=20} elect leader, _BECOME_MASTER_TASK_, _FINISH_ELECTION_]]: failed to commit cluster
 state version [1]
org.elasticsearch.cluster.coordination.FailedToCommitClusterStateException: publication failed
        at org.elasticsearch.cluster.coordination.Coordinator$CoordinatorPublication$3.onFailure(Coordinator.java:1353) ~[elasticsearch-7.2.0.jar:7.2.0]
        at org.elasticsearch.common.util.concurrent.ListenableFuture$1.run(ListenableFuture.java:101) ~[elasticsearch-7.2.0.jar:7.2.0]
        at org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:193) ~[elasticsearch-7.2.0.jar:7.2.0]
        at org.elasticsearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:92) ~[elasticsearch-7.2.0.jar:7.2.0]
        at org.elasticsearch.common.util.concurrent.ListenableFuture.addListener(ListenableFuture.java:54) ~[elasticsearch-7.2.0.jar:7.2.0]
        at org.elasticsearch.cluster.coordination.Coordinator$CoordinatorPublication.onCompletion(Coordinator.java:1293) ~[elasticsearch-7.2.0.jar:7.2.0]
        at org.elasticsearch.cluster.coordination.Publication.onPossibleCompletion(Publication.java:124) ~[elasticsearch-7.2.0.jar:7.2.0]
        at org.elasticsearch.cluster.coordination.Publication.onPossibleCommitFailure(Publication.java:172) ~[elasticsearch-7.2.0.jar:7.2.0]
        at org.elasticsearch.cluster.coordination.Publication.access$600(Publication.java:41) ~[elasticsearch-7.2.0.jar:7.2.0]
        at org.elasticsearch.cluster.coordination.Publication$PublicationTarget$PublishResponseHandler.onFailure(Publication.java:348) ~[elasticsearch-7.2.0.jar:7.2.0]
        at org.elasticsearch.cluster.coordination.Coordinator$6.onFailure(Coordinator.java:1080) ~[elasticsearch-7.2.0.jar:7.2.0]
        at org.elasticsearch.cluster.coordination.PublicationTransportHandler$2$1.onFailure(PublicationTransportHandler.java:194) ~[elasticsearch-7.2.0.jar:7.2.0]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.onFailure(ThreadContext.java:743) ~[elasticsearch-7.2.0.jar:7.2.0]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:39) ~[elasticsearch-7.2.0.jar:7.2.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:835) [?:?]
Caused by: java.lang.IllegalStateException: index [i/EW17gwGGT5KefG_28xQrcQ] version not supported: 5.6.16 minimum compatible index version is: 6.0.0-beta1
        at org.elasticsearch.cluster.coordination.JoinTaskExecutor.ensureIndexCompatibility(JoinTaskExecutor.java:238) ~[elasticsearch-7.2.0.jar:7.2.0]
        at org.elasticsearch.cluster.coordination.JoinTaskExecutor.lambda$addBuiltInJoinValidators$0(JoinTaskExecutor.java:281) ~[elasticsearch-7.2.0.jar:7.2.0]
        at org.elasticsearch.cluster.coordination.Coordinator.lambda$handlePublishRequest$2(Coordinator.java:313) ~[elasticsearch-7.2.0.jar:7.2.0]
        at java.util.ArrayList.forEach(ArrayList.java:1540) ~[?:?]
        at java.util.Collections$UnmodifiableCollection.forEach(Collections.java:1083) ~[?:?]
        at org.elasticsearch.cluster.coordination.Coordinator.handlePublishRequest(Coordinator.java:313) ~[elasticsearch-7.2.0.jar:7.2.0]
        at org.elasticsearch.cluster.coordination.PublicationTransportHandler$2$1.doRun(PublicationTransportHandler.java:199) ~[elasticsearch-7.2.0.jar:7.2.0]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:758) ~[elasticsearch-7.2.0.jar:7.2.0]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.2.0.jar:7.2.0]
        ... 3 more

The node should fail earlier, and harder, than it does today. But it's a bit trickier than that: what does the user do once their upgraded node is refusing to start? By the time we can have noticed we've got a bad index version we will already have constructed the NodeEnvironment, and therefore written stuff to disk, and that means a downgrade is now unsafe. (#41731 would positively block a subsequent downgrade in a similar situation with a 6.x -> 7.x -> 8.x double-upgrade).

@DaveCTurner DaveCTurner added >bug :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. labels Jul 11, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Jul 12, 2019
Today we fail the node at startup if it contains an index that is too old to be
compatible with the current version, unless that index is closed. If the index
is closed then the node will start up and this puts us into a bad state: the
index cannot be opened and must be reindexed using an earlier version, but we
offer no way to get that index into a node running an earlier version so that
it can be reindexed. Downgrading the node in-place is decidedly unsupported and
cannot be expected to work since the node already started up and upgraded the
rest of its metadata. Since elastic#41731 we actively reject downgrades to versions ≥
v7.2.0 too.

This commit prevents the node from starting in the presence of any too-old
indices (closed or not). In particular, it does not write any upgraded metadata
in this situation, increasing the chances an in-place downgrade might be
successful. We still actively reject the downgrade using elastic#41731, because we
wrote the node metadata file before checking the index metadata, but at least
there is a way to override this check.

Relates elastic#21830, elastic#44230
@DaveCTurner
Copy link
Contributor Author

On closer inspection the node only starts up at all if the ancient index is closed, and I think that's an oversight since the restriction to check only open indices was added despite deciding that it should apply to all indices in the conversation on the PR in question (there was another only-open-indices check in that PR which was removed in a later commit). I opened #44264 to address this.

My comment above about it being tricky still stands: we've written the node metadata file to disk before failing, so there isn't a truly supported (i.e. covered-by-tests) way forwards once you've hit this.

DaveCTurner added a commit that referenced this issue Jul 15, 2019
Today we fail the node at startup if it contains an index that is too old to be
compatible with the current version, unless that index is closed. If the index
is closed then the node will start up and this puts us into a bad state: the
index cannot be opened and must be reindexed using an earlier version, but we
offer no way to get that index into a node running an earlier version so that
it can be reindexed. Downgrading the node in-place is decidedly unsupported and
cannot be expected to work since the node already started up and upgraded the
rest of its metadata. Since #41731 we actively reject downgrades to versions ≥
v7.2.0 too.

This commit prevents the node from starting in the presence of any too-old
indices (closed or not). In particular, it does not write any upgraded metadata
in this situation, increasing the chances an in-place downgrade might be
successful. We still actively reject the downgrade using #41731, because we
wrote the node metadata file before checking the index metadata, but at least
there is a way to override this check.

Relates #21830, #44230
@DaveCTurner
Copy link
Contributor Author

We discussed this as a team today and are broadly ok with the (untested) expectation that each version will be able to read the node metadata file from future versions. Thus failing a node after writing the node metadata file (but before writing anything else) is still not too late to downgrade back to a known-good version to delete the broken index before trying the upgrade again.

This means that we think that #44264 should be backported to 7.x: it is only a breaking change to single-node clusters, and it is recoverable given the reasoning above.

@DaveCTurner
Copy link
Contributor Author

the (untested) expectation that each version will be able to read the node metadata file from future versions

Ok I tested this expectation and it's wrong. If you need to downgrade from ≥7.2 to <7.2 then you will need to delete the node metadata file, which'll generate a new node ID, and therefore you might also need to run elasticsearch-node unsafe-bootstrap.

@DaveCTurner
Copy link
Contributor Author

Closing this in favour of #44624, since the fundamental problem is more general than the one described here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. team-discuss
Projects
None yet
Development

No branches or pull requests

2 participants