Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent in-place downgrades and invalid upgrades #41731

Merged
merged 14 commits into from
May 21, 2019

Conversation

DaveCTurner
Copy link
Contributor

Downgrading an Elasticsearch node to an earlier version is unsupported, because
we do not make any attempt to guarantee that a node can read any of the on-disk
data written by a future version. Yet today we do not actively prevent
downgrades, and sometimes users will attempt to roll back a failed upgrade with
an in-place downgrade and get into an unrecoverable state.

This change adds the current version of the node to the node metadata file, and
checks the version found in this file against the current version at startup.
If the node cannot be sure of its ability to read the on-disk data then it
refuses to start, preserving any on-disk data in its upgraded state.

This change also adds a command-line tool to overwrite the node metadata file
without performing any version checks, to unsafely bypass these checks and
recover the historical and lenient behaviour.

Downgrading an Elasticsearch node to an earlier version is unsupported, because
we do not make any attempt to guarantee that a node can read any of the on-disk
data written by a future version. Yet today we do not actively prevent
downgrades, and sometimes users will attempt to roll back a failed upgrade with
an in-place downgrade and get into an unrecoverable state.

This change adds the current version of the node to the node metadata file, and
checks the version found in this file against the current version at startup.
If the node cannot be sure of its ability to read the on-disk data then it
refuses to start, preserving any on-disk data in its upgraded state.

This change also adds a command-line tool to overwrite the node metadata file
without performing any version checks, to unsafely bypass these checks and
recover the historical and lenient behaviour.
@DaveCTurner DaveCTurner added >enhancement resiliency :Core/Infra/Core Core issues without another label :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. v8.0.0 v7.2.0 labels May 2, 2019
@DaveCTurner DaveCTurner requested review from andrershov and ywelsch May 2, 2019 05:52
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra

@DaveCTurner
Copy link
Contributor Author

ci/1 failed for mysterious reasons; I opened #41785 to help in future.

@elasticmachine please run elasticsearch-ci/1.

Copy link
Contributor

@andrershov andrershov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @DaveCTurner!
In general looks good, I've left some comments.

return new NodeMetaData(nodeId);
final Version nodeVersion;
if (this.nodeVersion == null) {
assert Version.CURRENT.major <= Version.V_7_0_0.major + 1 : "version is required in the node metadata from v9 onwards";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not to use V_8_0_0 instead of V_7_0_0+1 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we upgrade master to version 9 the constant Version.V_8_0_0 will remain in existence, but Version.V_7_0_0 should be removed, giving a compile-time failure of this assertion.


public NodeMetaData upgradeToCurrentVersion() {
if (nodeVersion.equals(Version.V_EMPTY)) {
assert Version.CURRENT.major <= Version.V_7_0_0.major + 1 : "version is required in the node metadata from v9 onwards";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not to use V_8_0_0 instead of V_7_0_0+1 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we upgrade master to version 9 the constant Version.V_8_0_0 will remain in existence, but Version.V_7_0_0 should be removed, giving a compile-time failure of this assertion.

@Override
public Settings onNodeStopped(String nodeName) {
try {
for (Path dataPath : dataPaths) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This loop seems to be removing node metadata files. Why is it needed? You anyway re-write the files in the tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, this is a throwback to a version of this PR that failed the node if the node metadata file was removed but the data directory was nonempty. I think we should be stricter about starting a node on such a data directory, but that change has wider impact on the tests and is worthy of a separate discussion. Removed in 3320d5a.


public class NodeMetaDataTests extends ESTestCase {
private Version randomVersion() {
return rarely() ? Version.fromId(randomInt()) : VersionUtils.randomVersion(random());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we really want to use rarely part here, can we add a comment why Version from any int exists?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, comment added in d4888db.


final Path tempDir = createTempDir();
final Path stateDir = Files.createDirectory(tempDir.resolve(MetaDataStateFormat.STATE_DIR_NAME));
final InputStream resource = this.getClass().getResourceAsStream("testReadsFormatWithoutVersion.binary");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather avoid adding a binary file a resource. Can we have instead a method for creating NodeMetadData without version?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather do it this way, because the important behaviour is that we can correctly read versions of this file written by arbitrarily ancient versions of Elasticsearch of which the given resource is an example. If we create the metadata file in this test then we might change the format in a way that round-trips correctly but which doesn't reflect the behaviour of ancient versions of Elasticsearch any more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this tested by our full / rolling cluster upgrade tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Today it is, but in future we won't be able to produce version-free metadata files any more and yet we'll want to have a test saying that we refuse to load them in place of the one here. However the comment I left is misleading and I adjusted it in 9cb083f.

final Version nodeVersion = NodeMetaDataTests.tooNewVersion();
NodeMetaData.FORMAT.writeAndCleanup(new NodeMetaData(nodeId, nodeVersion), nodePaths);
final MockTerminal mockTerminal = new MockTerminal();
mockTerminal.addTextInput("n\n");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can do some randomization here as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, done in e805ec3.

}

//package-private for testing
void testExecute(Terminal terminal, OptionSet options, Environment env) throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My IDE claims this method is not used

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So does mine. Removed in 4961fd1.

Copy link
Contributor Author

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review @andrershov, I've addressed or answered your comments.


public NodeMetaData upgradeToCurrentVersion() {
if (nodeVersion.equals(Version.V_EMPTY)) {
assert Version.CURRENT.major <= Version.V_7_0_0.major + 1 : "version is required in the node metadata from v9 onwards";
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we upgrade master to version 9 the constant Version.V_8_0_0 will remain in existence, but Version.V_7_0_0 should be removed, giving a compile-time failure of this assertion.

return new NodeMetaData(nodeId);
final Version nodeVersion;
if (this.nodeVersion == null) {
assert Version.CURRENT.major <= Version.V_7_0_0.major + 1 : "version is required in the node metadata from v9 onwards";
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we upgrade master to version 9 the constant Version.V_8_0_0 will remain in existence, but Version.V_7_0_0 should be removed, giving a compile-time failure of this assertion.


public class NodeMetaDataTests extends ESTestCase {
private Version randomVersion() {
return rarely() ? Version.fromId(randomInt()) : VersionUtils.randomVersion(random());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, comment added in d4888db.


final Path tempDir = createTempDir();
final Path stateDir = Files.createDirectory(tempDir.resolve(MetaDataStateFormat.STATE_DIR_NAME));
final InputStream resource = this.getClass().getResourceAsStream("testReadsFormatWithoutVersion.binary");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather do it this way, because the important behaviour is that we can correctly read versions of this file written by arbitrarily ancient versions of Elasticsearch of which the given resource is an example. If we create the metadata file in this test then we might change the format in a way that round-trips correctly but which doesn't reflect the behaviour of ancient versions of Elasticsearch any more.

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few smaller comments, looking good o.w.


[float]
=== Synopsis

[source,shell]
--------------------------------------------------
bin/elasticsearch-node repurpose|unsafe-bootstrap|detach-cluster
bin/elasticsearch-node repurpose|unsafe-bootstrap|detach-cluster|overwrite-version
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about override-minimum-version instead of overwrite-version?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The version checks are not simply a minimum. Today they also include a maximum, and in future we may add more complex constraints. But I do prefer override, thanks, I adjusted things in ba6ae23.

try {
return new NodeEnvironment.NodePath(path);
} catch (IOException e) {
throw new ElasticsearchException("Unable to investigate path: " + path + ": " + e.getMessage());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not pass e as cause to the ElasticsearchException?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I blindly moved this method here from NodeRepurposeCommand, but yes an inner exception is preferable, see e12ca62.


final Path tempDir = createTempDir();
final Path stateDir = Files.createDirectory(tempDir.resolve(MetaDataStateFormat.STATE_DIR_NAME));
final InputStream resource = this.getClass().getResourceAsStream("testReadsFormatWithoutVersion.binary");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this tested by our full / rolling cluster upgrade tests?

@DaveCTurner DaveCTurner merged commit 749135b into elastic:master May 21, 2019
@DaveCTurner DaveCTurner deleted the 2019-05-01-prevent-downgrades branch May 21, 2019 06:52
DaveCTurner added a commit that referenced this pull request May 21, 2019
Downgrading an Elasticsearch node to an earlier version is unsupported, because
we do not make any attempt to guarantee that a node can read any of the on-disk
data written by a future version. Yet today we do not actively prevent
downgrades, and sometimes users will attempt to roll back a failed upgrade with
an in-place downgrade and get into an unrecoverable state.

This change adds the current version of the node to the node metadata file, and
checks the version found in this file against the current version at startup.
If the node cannot be sure of its ability to read the on-disk data then it
refuses to start, preserving any on-disk data in its upgraded state.

This change also adds a command-line tool to overwrite the node metadata file
without performing any version checks, to unsafely bypass these checks and
recover the historical and lenient behaviour.
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request May 21, 2019
…der-permit-primary-mode-only

* elastic/master:
  Move the FIPS configuration back to the build plugin (elastic#41989)
  Remove stray back tick that's messing up table format (elastic#41705)
  Add missing comma in code section (elastic#41678)
  add 7.1.1 and 6.8.1 versions (elastic#42253)
  Use spearate testkit dir for each run (elastic#42013)
  Add experimental and warnings to vector functions (elastic#42205)
  Fix version in tests since elastic#41906 was merged
  Bump version in BWC check after backport
  Prevent in-place downgrades and invalid upgrades (elastic#41731)
  Mute date_histo interval bwc test
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request May 21, 2019
* master: (176 commits)
  Avoid unnecessary persistence of retention leases (elastic#42299)
  [ML][TEST] Fix limits in AutodetectMemoryLimitIT (elastic#42279)
  [ML Data Frame] Persist and restore checkpoint and position (elastic#41942)
  mute failing filerealm hash caching tests (elastic#42304)
  Safer Wait for Snapshot Success in ClusterPrivilegeTests (elastic#40943)
  Remove 7.0.2 (elastic#42282)
  Revert "Remove 7.0.2 (elastic#42282)"
  [DOCS] Copied note on slicing support to Slicing section. Closes 26114 (elastic#40426)
  Remove 7.0.2 (elastic#42282)
  Mute all ml_datafeed_crud rolling upgrade tests
  Move the FIPS configuration back to the build plugin (elastic#41989)
  Remove stray back tick that's messing up table format (elastic#41705)
  Add missing comma in code section (elastic#41678)
  add 7.1.1 and 6.8.1 versions (elastic#42253)
  Use spearate testkit dir for each run (elastic#42013)
  Add experimental and warnings to vector functions (elastic#42205)
  Fix version in tests since elastic#41906 was merged
  Bump version in BWC check after backport
  Prevent in-place downgrades and invalid upgrades (elastic#41731)
  Mute date_histo interval bwc test
  ...
gurkankaymak pushed a commit to gurkankaymak/elasticsearch that referenced this pull request May 27, 2019
Downgrading an Elasticsearch node to an earlier version is unsupported, because
we do not make any attempt to guarantee that a node can read any of the on-disk
data written by a future version. Yet today we do not actively prevent
downgrades, and sometimes users will attempt to roll back a failed upgrade with
an in-place downgrade and get into an unrecoverable state.

This change adds the current version of the node to the node metadata file, and
checks the version found in this file against the current version at startup.
If the node cannot be sure of its ability to read the on-disk data then it
refuses to start, preserving any on-disk data in its upgraded state.

This change also adds a command-line tool to overwrite the node metadata file
without performing any version checks, to unsafely bypass these checks and
recover the historical and lenient behaviour.
@colings86 colings86 removed the :Core/Infra/Core Core issues without another label label Jun 18, 2019
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Jul 12, 2019
Today we fail the node at startup if it contains an index that is too old to be
compatible with the current version, unless that index is closed. If the index
is closed then the node will start up and this puts us into a bad state: the
index cannot be opened and must be reindexed using an earlier version, but we
offer no way to get that index into a node running an earlier version so that
it can be reindexed. Downgrading the node in-place is decidedly unsupported and
cannot be expected to work since the node already started up and upgraded the
rest of its metadata. Since elastic#41731 we actively reject downgrades to versions ≥
v7.2.0 too.

This commit prevents the node from starting in the presence of any too-old
indices (closed or not). In particular, it does not write any upgraded metadata
in this situation, increasing the chances an in-place downgrade might be
successful. We still actively reject the downgrade using elastic#41731, because we
wrote the node metadata file before checking the index metadata, but at least
there is a way to override this check.

Relates elastic#21830, elastic#44230
DaveCTurner added a commit that referenced this pull request Jul 15, 2019
Today we fail the node at startup if it contains an index that is too old to be
compatible with the current version, unless that index is closed. If the index
is closed then the node will start up and this puts us into a bad state: the
index cannot be opened and must be reindexed using an earlier version, but we
offer no way to get that index into a node running an earlier version so that
it can be reindexed. Downgrading the node in-place is decidedly unsupported and
cannot be expected to work since the node already started up and upgraded the
rest of its metadata. Since #41731 we actively reject downgrades to versions ≥
v7.2.0 too.

This commit prevents the node from starting in the presence of any too-old
indices (closed or not). In particular, it does not write any upgraded metadata
in this situation, increasing the chances an in-place downgrade might be
successful. We still actively reject the downgrade using #41731, because we
wrote the node metadata file before checking the index metadata, but at least
there is a way to override this check.

Relates #21830, #44230
mkleen added a commit to crate/crate that referenced this pull request May 10, 2021
mkleen added a commit to crate/crate that referenced this pull request May 10, 2021
mkleen added a commit to crate/crate that referenced this pull request May 10, 2021
mkleen added a commit to crate/crate that referenced this pull request May 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. >enhancement resiliency v7.2.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants