-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add cluster setting to enable failure store #118662
Add cluster setting to enable failure store #118662
Conversation
15a8e99
to
c932a09
Compare
This setting enables or disables the failure store for data streams based on matching the data stream name against a list of patterns. It acts as a default, and is overridden if the failure store is explicitly enabled or disabled either in a component template or using the data stream options API. (See the PR for explanations of some of the changes here.)
c932a09
to
b06c9e8
Compare
@@ -17,7 +17,6 @@ | |||
import org.elasticsearch.action.ActionRunnable; | |||
import org.elasticsearch.action.ActionType; | |||
import org.elasticsearch.action.DocWriteRequest; | |||
import org.elasticsearch.action.DocWriteRequest.OpType; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another drive-by: We were being inconsistent before about whether we were writing DocWriteRequest.OpType
or OpType
.
if (options.ignoreUnavailable()) { | ||
return false; | ||
} else { | ||
throw new FailureIndexNotSupportedException(index); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parentDataStream.isFailureStoreEnabled()
check is removed here because we should treat failure store indexes as failure store indexes whether or not the failure store is enabled for the data stream at this moment in time.
The code looks a little different because I've also combined parentDataStream.isFailureStoreIndex(index.getName())
into the previous if
condition, rather than having needlessly nested if
blocks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice fix!
@@ -2486,7 +2486,6 @@ private static boolean assertContainsIndexIfDataStream(DataStream parent, IndexM | |||
assert parent == null | |||
|| parent.getIndices().stream().anyMatch(index -> indexMetadata.getIndex().getName().equals(index.getName())) | |||
|| (DataStream.isFailureStoreFeatureFlagEnabled() | |||
&& parent.isFailureStoreEnabled() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parent.isFailureStoreEnabled()
check is removed here because we should treat failure store indexes as failure store indexes whether or not the failure store is enabled for the data stream at this moment in time.
@@ -2512,7 +2511,7 @@ private static void collectDataStreams( | |||
for (Index i : dataStream.getIndices()) { | |||
indexToDataStreamLookup.put(i.getName(), dataStream); | |||
} | |||
if (DataStream.isFailureStoreFeatureFlagEnabled() && dataStream.isFailureStoreEnabled()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above.
@@ -265,12 +266,13 @@ static ClusterState createDataStream( | |||
? MetadataIndexTemplateService.resolveDataStreamOptions(template, systemDataStreamDescriptor.getComponentTemplates()) | |||
: MetadataIndexTemplateService.resolveDataStreamOptions(template, metadata.componentTemplates()); | |||
final DataStreamOptions dataStreamOptions = dataStreamOptionsTemplate.mapAndGet(DataStreamOptions.Template::toDataStreamOptions); | |||
var isFailureStoreEnabled = dataStreamOptions != null && dataStreamOptions.isFailureStoreEnabled(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The check on dataStreamOptions != null && dataStreamOptions.isFailureStoreEnabled()
is removed in the two changes below here. Instead, we rely on the initializeFailureStore
parameter only being set if the failure store is enabled. This avoids checking whether the failure store is enabled twice during the same operation, which would (a) be inefficient, and (b) lead to a small chance of bad behaviour if the setting changes between the two checks. This change adds comments to everywhere this boolean value is tracked to make its significance clear and to help demonstrate that this is safe.
failureStoreExplicitlyEnabledCounter++; | ||
} | ||
if (ds.isFailureStoreEffectivelyEnabled(dataStreamFailureStoreSettings)) { | ||
failureStoreEffectivelyEnabledCounter++; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the x-pack usage API, we return counts for both explicitly and effectively enabled.
@@ -118,6 +119,7 @@ public static boolean isFailureStoreFeatureFlagEnabled() { | |||
private final DataStreamIndices backingIndices; | |||
private final DataStreamIndices failureIndices; | |||
|
|||
// visible for testing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment was in the wrong place before. This constructor is only used in tests...
@@ -151,7 +153,6 @@ public DataStream( | |||
); | |||
} | |||
|
|||
// visible for testing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...and this one is used in production code.
...treams/src/yamlRestTest/resources/rest-api-spec/test/data_stream/200_require_data_stream.yml
Outdated
Show resolved
Hide resolved
...ams/src/yamlRestTest/resources/rest-api-spec/test/data_stream/210_rollover_failure_store.yml
Outdated
Show resolved
Hide resolved
Pinging @elastic/es-data-management (Team:Data Management) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work @PeteGillinElastic ! It really good, I added some minor comments and opened one discussion about when to check isInternal
. Curious to hear your thoughts.
...es/data-streams/src/javaRestTest/java/org/elasticsearch/datastreams/DataStreamOptionsIT.java
Outdated
Show resolved
Hide resolved
if (options.ignoreUnavailable()) { | ||
return false; | ||
} else { | ||
throw new FailureIndexNotSupportedException(index); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice fix!
...n/core/src/main/java/org/elasticsearch/xpack/core/datastreams/DataStreamFeatureSetUsage.java
Show resolved
Hide resolved
server/src/test/java/org/elasticsearch/cluster/metadata/DataStreamTests.java
Outdated
Show resolved
Hide resolved
server/src/test/java/org/elasticsearch/cluster/metadata/DataStreamTests.java
Outdated
Show resolved
Hide resolved
assertThat(dataStreamFailureStoreEnabled.isFailureStoreExplicitlyEnabled(), is(true)); | ||
} | ||
|
||
public void testIsFailureStoreEffectivelyEnabled_instanceMethod() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we test here the internal data stream case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, go on, why not!
/** | ||
* Returns whether the settings indicate that the failure store should be enabled by the cluster settings for the given name. | ||
* | ||
* @param name The data stream name | ||
* @param isInternal Whether this is an internal (system or dot-prefix) data stream: if true, this method will always return false | ||
*/ | ||
public boolean failureStoreEnabledForDataStreamName(String name, boolean isInternal) { | ||
assert DataStream.isFailureStoreFeatureFlagEnabled() : "Testing whether failure store is enabled should be behind by feature flag"; | ||
return isInternal == false && failureStoreEnabledByName.test(name); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was a bit surprised to see the isInternal
flag here. It feels to me like we are leaking information. I can walk you through how I was separating the concerns in my head and you can tell me if you agree or not.
The DataStreamFailureStoreSettings
is only holding the failure store as default cluster setting, and since that's a list of regexes I think the predicate very well captures this. I think here, we shouldn't bother with the internal data streams.
The logic that this setting doesn't apply to internal data streams, I think, belongs with the resolution of the effective failure store enabled flag. There we pose the questions:
- "is the flag explicitly set?", if not,
- "is the data stream internal?", if not,
then we check with the settings. Does this make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I totally see what you mean.
The reason I did it this way is that it would be a mistake to ever call settings. failureStoreEnabledForDataStreamName(name)
without also doing the isInternal
check. By putting the check in this method, we force the caller to think about it. Without, it would be possible for the caller to forget about that part of the requirements.
However, when I originally went through that thought process, I thought that we were going to be calling this method in multiple places. I've actually managed to common up the methods in DataStream
so that it's only called from one place. So now my argument doesn't really hold water, as we have exactly one place we have to remember it either way 🙂.
So I think I'll take your advice and change it to the more intuitive thing.
Thanks for the comments @gmarouli . I think I've addressed everything. |
…order more likely to be optimal" This reverts commit e63698c. My thinking here made no sense, sorry for the noise!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome work! For me it's almost ready, I left some nits and opened a short, I hope, discussion about the yaml test node feature.
...es/data-streams/src/javaRestTest/java/org/elasticsearch/datastreams/DataStreamOptionsIT.java
Show resolved
Hide resolved
...es/data-streams/src/javaRestTest/java/org/elasticsearch/datastreams/DataStreamOptionsIT.java
Show resolved
Hide resolved
...treams/src/main/java/org/elasticsearch/datastreams/action/TransportGetDataStreamsAction.java
Show resolved
Hide resolved
assertThat( | ||
response.getDataStreams().stream().map(GetDataStreamAction.Response.DataStreamInfo::isFailureStoreEffectivelyEnabled).toList(), | ||
contains(false) | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: If I understand correctly, this returns a single data stream right? I find it easier when I read it to just check that one data stream instead of checking a list that contains false?
Personally, I like more explicit checks such as:
GetDataStreamAction.Response.DataStreamInfo dataStream = response.getDataStreams().getFirst();
assertThat(dataStream.getDataStream().getName(), equalTo("data-stream-1"));
assertThat(dataStream. isFailureStoreEffectivelyEnabled(), equalTo(false));
Otherwise, if you prefer the list contains matching, the other tests are also using this helper method:
assertThat(response.getDataStreams(), transformedItemsMatch(GetDataStreamAction.Response.DataStreamInfo::isFailureStoreEffectivelyEnabled, contains(false)));
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the Hamcrest contains
matcher is actually really confusing: This looks like it's checking that at least one element in the list is false
, but it's actually checking that the list contains exactly one element which is false
. (I kind of hate Hamcrest, by the way. More carefully designed libraries like AssertJ and Truth use containsExactly
for this assertion.)
So, while I think what I've written is doing logically the right thing, I agree that it's not readable code. My excuse is that I was following the pattern from elsewhere in this class (see e.g. the assertion in testGetTimeSeriesDataStream
). But there's no reason not to improve things! I'll push something more explicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, I think there is something similar to containsExactly
. I can look it up if you want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My point was that Hamcrest's contains
actually does do 'contains exactly'. Here's the javadoc:
Creates a matcher for
Iterable
s that matches when a single pass over the examinedIterable
yields a series of items, each logically equal to the corresponding item in the specified items. For a positive match, the examined iterable must be of the same length as the number of specified items.
It's just misleadingly named. Almost everyone would assume that assertThat(someCollection, contains(someValue))
would be equivalent to asserting that someCollection.contains(someValue)
is true, when actually it's a much stronger assertion.
- requires: | ||
reason: "Data stream options was added in 8.18+" | ||
test_runner_features: [ capabilities, allowed_warnings, contains ] | ||
capabilities: | ||
- method: POST | ||
path: /{index}/_doc | ||
capabilities: [ 'failure_store_status' ] | ||
- method: POST | ||
path: /_index_template/{template} | ||
capabilities: [ 'failure_store_in_template' ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking out loud. Theoretically, we need something here to signal that this test can only work with versions that the setting is present (I guess this should be a node feature?).
I think this works now because it is piggy backing on the API capability failure_store_in_template
since they have a chance of being in the same version. However, it is not a given :). So I would like to start the discussion if we want to introduce a different flag or we want to rely on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, yeah, you're right, I was being a bit lazy and piggy-backing on the failure_store_in_template
capability. You only added that last week, so I do think they have a good chance of appearing in the same version... But it's not certain. I've pushed a commit which adds a node feature for this... Let me know what you think.
I do have a couple of follow-up questions:
- Should do
assumeTrue(clusterHasFeature(...))
in the Java REST testDataStreamOptionsIT
? It would be easy enough, but we don't seem to have an equivalent check for the API capability right now, so I'm not sure. - I'm not quite sure why we have this
requires
block in the YAML test. Do we ever run tests fromHEAD
against an older server or a server without all feature flags enabled? (From my limited experimentation, it seems like the effect of therequires
block would be to make the test fail-fast if we did, anyway.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the differences between YAML tests and Java REST tests is that YAML tests are also run in bwc, meaning they run against older versions of elasticsearch. This is why we use these "requirements" to ensure that the tests are going to be run against versions that support the feature.
Also this is why, you do not need this in Java REST test.
Now going back to the node feature. I think we need to think as well, do we need this node feature in the production code? If not, we can rethink if it's useful to add it or not.
Do we ever run tests from HEAD against an older server or a server without all feature flags enabled?
In the tests we try to enable the feature flags, probably we can find the code that enabled failure store in these tests as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But BWC is about running an old test suite against a new cluster, right?
Feature flags are enabled (and can't be disabled) in all snapshot builds:
elasticsearch/server/src/main/java/org/elasticsearch/common/util/FeatureFlag.java
Line 72 in 8134c79
if (build.isSnapshot()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now going back to the node feature. I think we need to think as well, do we need this node feature in the production code? If not, we can rethink if it's useful to add it or not.
I don't really understand the node features framework well enough to answer this question. I'll defer to you, or we can go ask someone who knows this stuff better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I can easily revert that commit if we don't need it.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But BWC is about running an old test suite against a new cluster, right?
No, if I understand it correctly, we run the new test suite to previous versions, including mixed clusters.
Feature flags are enabled (and can't be disabled) in all snapshot builds:
That is true, but we also run the test suites as release, so we explicitly enable the feature flags because we might end up with mixed versions.
I don't really understand the node features framework well enough to answer this question. I'll defer to you, or we can go ask someone who knows this stuff better.
There are certain features that need to be supported by all nodes before they can take effect. For example, hypothetically, we might want to allow users to use the failure store, only when all nodes support the failure store. Otherwise, you might have a node adding failure store but the bulk request goes to a different node that doesn't know about the failure store. This can be ensure via a node feature.
A node feature can be used also in yaml tests to see if the feature is supported.
So, I was wondering here, considering that:
- We are still behind a feature flag.
- I think, without being 100% sure, that the capabilities of these APIs will remain there and will be in the same version as this.
Maybe it depends if we are planning to introduce a general node feature for the failure store. @jbaiera & @dakrone . Any thoughts?
Could we bundle this under a failure store node feature when we remove the feature flag?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, having slept on it, I think this:
Could we bundle this under a failure store node feature when we remove the feature flag?
is the conclusion I'm leaning towards as well.
Reasoning:
- We will not remove the feature flag until the implementation is complete.
- Features are not, as I understand it, considered supported while they're behind a feature flag (the user would have to set the property or use a snapshot build, and figure out how to use it without documentation).
- Therefore we don't need to support the case of a mixed cluster in which the feature is enabled on some/all nodes and some/all nodes only have a partial implementation.
- Therefore there is no need for a node feature for the cluster setting bit itself, and I should revert the commit which added it.
- However we might want to support the case of a mixed cluster in which the feature is fully-implemented and enabled by default on some nodes and disabled on others (e.g. if we finish the implementation and remove the feature flag in 8.18, and the cluster has a mix of 8.17 and 8.18).
- Therefore we might need to use the node feature mechanism for the whole failure store feature.
- But that shouldn't be part of this PR — perhaps it should be part of the PR which removes the feature flag, just replacing checks on the feature flag with checks on the cluster feature (so it'll be enabled only if every node has the feature).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed in the meeting, I've removed the node feature and added a capability on the PUT /_cluster/settings
API instead (with a comment mentioning that we plan to replace this when we remove the feature gate).
"processors": [ | ||
{ | ||
"fail": { | ||
"message" : "pipeline go boom" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<3
.../yamlRestTest/resources/rest-api-spec/test/data_stream/220_failure_store_cluster_setting.yml
Outdated
Show resolved
Hide resolved
.../yamlRestTest/resources/rest-api-spec/test/data_stream/220_failure_store_cluster_setting.yml
Outdated
Show resolved
Hide resolved
Thanks again @gmarouli , all comments responded to now. |
server/src/main/java/org/elasticsearch/cluster/metadata/DataStreamFailureStoreSettings.java
Show resolved
Hide resolved
I've just pushed a commit to do date math resolution on the name before matching it against the patterns in the cluster setting, as discussed on Slack. As part of this, I removed the test in |
Just rounding up where I think we are on this: There is a thread on BWC, mixed cluster support, node features, and the like here: #118662 (comment). That's the only open issue that I'm aware of. But we haven't heard from @jbaiera yet... |
This reverts commit 2bddabe. Following discussion, we're going to use a capability on the `PUT /_cluster/settings` API instead.
Okay, I've pushed changes for the above, so I think I've resolved all the issues I was aware of and this PR is ready for review again. 😮💨 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think everything LGTM pending final approval from @gmarouli. Sorry for jumping in on this late but great work on the discussions that have already happened so far!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, great work and great collaboration @PeteGillinElastic 🚀
💔 Backport failed
You can use sqren/backport to manually backport by running |
💚 All backports created successfully
Questions ?Please refer to the Backport tool documentation |
This setting enables or disables the failure store for data streams based on matching the data stream name against a list of patterns. It acts as a default, and is overridden if the failure store is explicitly enabled or disabled either in a component template or using the data stream options API. (See the PR for explanations of some of the changes here.) (cherry picked from commit 97e6bb6) # Conflicts: # server/src/main/java/org/elasticsearch/action/bulk/BulkOperation.java # server/src/main/java/org/elasticsearch/node/NodeConstruction.java
This setting enables or disables the failure store for data streams based on matching the data stream name against a list of patterns. It acts as a default, and is overridden if the failure store is explicitly enabled or disabled either in a component template or using the data stream options API. (See the PR for explanations of some of the changes here.) (cherry picked from commit 97e6bb6) # Conflicts: # server/src/main/java/org/elasticsearch/action/bulk/BulkOperation.java # server/src/main/java/org/elasticsearch/node/NodeConstruction.java
This setting enables or disables the failure store for data streams based on matching the data stream name against a list of patterns. It acts as a default, and is overridden if the failure store is explicitly enabled or disabled either in a component template or using the data stream options API.
(See the PR for explanations of some of the changes here.)