-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better handling of ancient indices #44230
Comments
Pinging @elastic/es-distributed |
Today we fail the node at startup if it contains an index that is too old to be compatible with the current version, unless that index is closed. If the index is closed then the node will start up and this puts us into a bad state: the index cannot be opened and must be reindexed using an earlier version, but we offer no way to get that index into a node running an earlier version so that it can be reindexed. Downgrading the node in-place is decidedly unsupported and cannot be expected to work since the node already started up and upgraded the rest of its metadata. Since elastic#41731 we actively reject downgrades to versions ≥ v7.2.0 too. This commit prevents the node from starting in the presence of any too-old indices (closed or not). In particular, it does not write any upgraded metadata in this situation, increasing the chances an in-place downgrade might be successful. We still actively reject the downgrade using elastic#41731, because we wrote the node metadata file before checking the index metadata, but at least there is a way to override this check. Relates elastic#21830, elastic#44230
On closer inspection the node only starts up at all if the ancient index is closed, and I think that's an oversight since the restriction to check only open indices was added despite deciding that it should apply to all indices in the conversation on the PR in question (there was another only-open-indices check in that PR which was removed in a later commit). I opened #44264 to address this. My comment above about it being tricky still stands: we've written the node metadata file to disk before failing, so there isn't a truly supported (i.e. covered-by-tests) way forwards once you've hit this. |
Today we fail the node at startup if it contains an index that is too old to be compatible with the current version, unless that index is closed. If the index is closed then the node will start up and this puts us into a bad state: the index cannot be opened and must be reindexed using an earlier version, but we offer no way to get that index into a node running an earlier version so that it can be reindexed. Downgrading the node in-place is decidedly unsupported and cannot be expected to work since the node already started up and upgraded the rest of its metadata. Since #41731 we actively reject downgrades to versions ≥ v7.2.0 too. This commit prevents the node from starting in the presence of any too-old indices (closed or not). In particular, it does not write any upgraded metadata in this situation, increasing the chances an in-place downgrade might be successful. We still actively reject the downgrade using #41731, because we wrote the node metadata file before checking the index metadata, but at least there is a way to override this check. Relates #21830, #44230
We discussed this as a team today and are broadly ok with the (untested) expectation that each version will be able to read the node metadata file from future versions. Thus failing a node after writing the node metadata file (but before writing anything else) is still not too late to downgrade back to a known-good version to delete the broken index before trying the upgrade again. This means that we think that #44264 should be backported to |
Ok I tested this expectation and it's wrong. If you need to downgrade from ≥7.2 to <7.2 then you will need to delete the node metadata file, which'll generate a new node ID, and therefore you might also need to run |
Closing this in favour of #44624, since the fundamental problem is more general than the one described here. |
If you create an index in a one-node 5.6 cluster, close it, then upgrade this node to 6.8 and then again to 7.2 without removing the 5.6 index then the node fails to properly fail. Instead, it goes into a loop of repeatedly winning the election and then failing the first publication and then trying again:
The node should fail earlier, and harder, than it does today. But it's a bit trickier than that: what does the user do once their upgraded node is refusing to start? By the time we can have noticed we've got a bad index version we will already have constructed the
NodeEnvironment
, and therefore written stuff to disk, and that means a downgrade is now unsafe. (#41731 would positively block a subsequent downgrade in a similar situation with a 6.x -> 7.x -> 8.x double-upgrade).The text was updated successfully, but these errors were encountered: