Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cluster setting to enable failure store #118662

Merged

Conversation

PeteGillinElastic
Copy link
Member

This setting enables or disables the failure store for data streams based on matching the data stream name against a list of patterns. It acts as a default, and is overridden if the failure store is explicitly enabled or disabled either in a component template or using the data stream options API.

(See the PR for explanations of some of the changes here.)

@PeteGillinElastic PeteGillinElastic force-pushed the failure-store-global-config branch 2 times, most recently from 15a8e99 to c932a09 Compare December 13, 2024 14:49
This setting enables or disables the failure store for data streams
based on matching the data stream name against a list of patterns. It
acts as a default, and is overridden if the failure store is
explicitly enabled or disabled either in a component template or using
the data stream options API.

(See the PR for explanations of some of the changes here.)
@PeteGillinElastic PeteGillinElastic force-pushed the failure-store-global-config branch from c932a09 to b06c9e8 Compare December 13, 2024 20:27
@@ -17,7 +17,6 @@
import org.elasticsearch.action.ActionRunnable;
import org.elasticsearch.action.ActionType;
import org.elasticsearch.action.DocWriteRequest;
import org.elasticsearch.action.DocWriteRequest.OpType;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another drive-by: We were being inconsistent before about whether we were writing DocWriteRequest.OpType or OpType.

if (options.ignoreUnavailable()) {
return false;
} else {
throw new FailureIndexNotSupportedException(index);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parentDataStream.isFailureStoreEnabled() check is removed here because we should treat failure store indexes as failure store indexes whether or not the failure store is enabled for the data stream at this moment in time.

The code looks a little different because I've also combined parentDataStream.isFailureStoreIndex(index.getName()) into the previous if condition, rather than having needlessly nested if blocks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice fix!

@@ -2486,7 +2486,6 @@ private static boolean assertContainsIndexIfDataStream(DataStream parent, IndexM
assert parent == null
|| parent.getIndices().stream().anyMatch(index -> indexMetadata.getIndex().getName().equals(index.getName()))
|| (DataStream.isFailureStoreFeatureFlagEnabled()
&& parent.isFailureStoreEnabled()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parent.isFailureStoreEnabled() check is removed here because we should treat failure store indexes as failure store indexes whether or not the failure store is enabled for the data stream at this moment in time.

@@ -2512,7 +2511,7 @@ private static void collectDataStreams(
for (Index i : dataStream.getIndices()) {
indexToDataStreamLookup.put(i.getName(), dataStream);
}
if (DataStream.isFailureStoreFeatureFlagEnabled() && dataStream.isFailureStoreEnabled()) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above.

@@ -265,12 +266,13 @@ static ClusterState createDataStream(
? MetadataIndexTemplateService.resolveDataStreamOptions(template, systemDataStreamDescriptor.getComponentTemplates())
: MetadataIndexTemplateService.resolveDataStreamOptions(template, metadata.componentTemplates());
final DataStreamOptions dataStreamOptions = dataStreamOptionsTemplate.mapAndGet(DataStreamOptions.Template::toDataStreamOptions);
var isFailureStoreEnabled = dataStreamOptions != null && dataStreamOptions.isFailureStoreEnabled();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check on dataStreamOptions != null && dataStreamOptions.isFailureStoreEnabled() is removed in the two changes below here. Instead, we rely on the initializeFailureStore parameter only being set if the failure store is enabled. This avoids checking whether the failure store is enabled twice during the same operation, which would (a) be inefficient, and (b) lead to a small chance of bad behaviour if the setting changes between the two checks. This change adds comments to everywhere this boolean value is tracked to make its significance clear and to help demonstrate that this is safe.

failureStoreExplicitlyEnabledCounter++;
}
if (ds.isFailureStoreEffectivelyEnabled(dataStreamFailureStoreSettings)) {
failureStoreEffectivelyEnabledCounter++;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the x-pack usage API, we return counts for both explicitly and effectively enabled.

@@ -118,6 +119,7 @@ public static boolean isFailureStoreFeatureFlagEnabled() {
private final DataStreamIndices backingIndices;
private final DataStreamIndices failureIndices;

// visible for testing
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment was in the wrong place before. This constructor is only used in tests...

@@ -151,7 +153,6 @@ public DataStream(
);
}

// visible for testing
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...and this one is used in production code.

@PeteGillinElastic PeteGillinElastic added :Data Management/Data streams Data streams and their lifecycles >non-issue auto-backport Automatically create backport pull requests when merged v8.18.0 labels Dec 13, 2024
@PeteGillinElastic PeteGillinElastic marked this pull request as ready for review December 14, 2024 11:20
@PeteGillinElastic PeteGillinElastic requested a review from a team as a code owner December 14, 2024 11:20
@elasticsearchmachine elasticsearchmachine added the Team:Data Management Meta label for data/management team label Dec 14, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@gmarouli gmarouli self-requested a review December 16, 2024 09:00
Copy link
Contributor

@gmarouli gmarouli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @PeteGillinElastic ! It really good, I added some minor comments and opened one discussion about when to check isInternal. Curious to hear your thoughts.

if (options.ignoreUnavailable()) {
return false;
} else {
throw new FailureIndexNotSupportedException(index);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice fix!

assertThat(dataStreamFailureStoreEnabled.isFailureStoreExplicitlyEnabled(), is(true));
}

public void testIsFailureStoreEffectivelyEnabled_instanceMethod() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we test here the internal data stream case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, go on, why not!

Comment on lines 56 to 65
/**
* Returns whether the settings indicate that the failure store should be enabled by the cluster settings for the given name.
*
* @param name The data stream name
* @param isInternal Whether this is an internal (system or dot-prefix) data stream: if true, this method will always return false
*/
public boolean failureStoreEnabledForDataStreamName(String name, boolean isInternal) {
assert DataStream.isFailureStoreFeatureFlagEnabled() : "Testing whether failure store is enabled should be behind by feature flag";
return isInternal == false && failureStoreEnabledByName.test(name);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was a bit surprised to see the isInternal flag here. It feels to me like we are leaking information. I can walk you through how I was separating the concerns in my head and you can tell me if you agree or not.

The DataStreamFailureStoreSettings is only holding the failure store as default cluster setting, and since that's a list of regexes I think the predicate very well captures this. I think here, we shouldn't bother with the internal data streams.

The logic that this setting doesn't apply to internal data streams, I think, belongs with the resolution of the effective failure store enabled flag. There we pose the questions:

  • "is the flag explicitly set?", if not,
  • "is the data stream internal?", if not,
    then we check with the settings. Does this make sense?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I totally see what you mean.

The reason I did it this way is that it would be a mistake to ever call settings. failureStoreEnabledForDataStreamName(name) without also doing the isInternal check. By putting the check in this method, we force the caller to think about it. Without, it would be possible for the caller to forget about that part of the requirements.

However, when I originally went through that thought process, I thought that we were going to be calling this method in multiple places. I've actually managed to common up the methods in DataStream so that it's only called from one place. So now my argument doesn't really hold water, as we have exactly one place we have to remember it either way 🙂.

So I think I'll take your advice and change it to the more intuitive thing.

@PeteGillinElastic
Copy link
Member Author

Thanks for the comments @gmarouli . I think I've addressed everything.

…order more likely to be optimal"

This reverts commit e63698c.

My thinking here made no sense, sorry for the noise!
Copy link
Contributor

@gmarouli gmarouli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work! For me it's almost ready, I left some nits and opened a short, I hope, discussion about the yaml test node feature.

assertThat(
response.getDataStreams().stream().map(GetDataStreamAction.Response.DataStreamInfo::isFailureStoreEffectivelyEnabled).toList(),
contains(false)
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: If I understand correctly, this returns a single data stream right? I find it easier when I read it to just check that one data stream instead of checking a list that contains false?

Personally, I like more explicit checks such as:

GetDataStreamAction.Response.DataStreamInfo dataStream = response.getDataStreams().getFirst();
assertThat(dataStream.getDataStream().getName(), equalTo("data-stream-1"));
assertThat(dataStream. isFailureStoreEffectivelyEnabled(), equalTo(false));

Otherwise, if you prefer the list contains matching, the other tests are also using this helper method:

assertThat(response.getDataStreams(), transformedItemsMatch(GetDataStreamAction.Response.DataStreamInfo::isFailureStoreEffectivelyEnabled, contains(false)));

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the Hamcrest contains matcher is actually really confusing: This looks like it's checking that at least one element in the list is false, but it's actually checking that the list contains exactly one element which is false. (I kind of hate Hamcrest, by the way. More carefully designed libraries like AssertJ and Truth use containsExactly for this assertion.)

So, while I think what I've written is doing logically the right thing, I agree that it's not readable code. My excuse is that I was following the pattern from elsewhere in this class (see e.g. the assertion in testGetTimeSeriesDataStream). But there's no reason not to improve things! I'll push something more explicit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, I think there is something similar to containsExactly. I can look it up if you want.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point was that Hamcrest's contains actually does do 'contains exactly'. Here's the javadoc:

Creates a matcher for Iterables that matches when a single pass over the examined Iterable yields a series of items, each logically equal to the corresponding item in the specified items. For a positive match, the examined iterable must be of the same length as the number of specified items.

It's just misleadingly named. Almost everyone would assume that assertThat(someCollection, contains(someValue)) would be equivalent to asserting that someCollection.contains(someValue) is true, when actually it's a much stronger assertion.

Comment on lines +2 to +11
- requires:
reason: "Data stream options was added in 8.18+"
test_runner_features: [ capabilities, allowed_warnings, contains ]
capabilities:
- method: POST
path: /{index}/_doc
capabilities: [ 'failure_store_status' ]
- method: POST
path: /_index_template/{template}
capabilities: [ 'failure_store_in_template' ]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking out loud. Theoretically, we need something here to signal that this test can only work with versions that the setting is present (I guess this should be a node feature?).

I think this works now because it is piggy backing on the API capability failure_store_in_template since they have a chance of being in the same version. However, it is not a given :). So I would like to start the discussion if we want to introduce a different flag or we want to rely on this.

Copy link
Member Author

@PeteGillinElastic PeteGillinElastic Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, yeah, you're right, I was being a bit lazy and piggy-backing on the failure_store_in_template capability. You only added that last week, so I do think they have a good chance of appearing in the same version... But it's not certain. I've pushed a commit which adds a node feature for this... Let me know what you think.

I do have a couple of follow-up questions:

  1. Should do assumeTrue(clusterHasFeature(...)) in the Java REST test DataStreamOptionsIT? It would be easy enough, but we don't seem to have an equivalent check for the API capability right now, so I'm not sure.
  2. I'm not quite sure why we have this requires block in the YAML test. Do we ever run tests from HEAD against an older server or a server without all feature flags enabled? (From my limited experimentation, it seems like the effect of the requires block would be to make the test fail-fast if we did, anyway.)

Copy link
Contributor

@gmarouli gmarouli Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the differences between YAML tests and Java REST tests is that YAML tests are also run in bwc, meaning they run against older versions of elasticsearch. This is why we use these "requirements" to ensure that the tests are going to be run against versions that support the feature.

Also this is why, you do not need this in Java REST test.

Now going back to the node feature. I think we need to think as well, do we need this node feature in the production code? If not, we can rethink if it's useful to add it or not.

Do we ever run tests from HEAD against an older server or a server without all feature flags enabled?

In the tests we try to enable the feature flags, probably we can find the code that enabled failure store in these tests as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But BWC is about running an old test suite against a new cluster, right?

Feature flags are enabled (and can't be disabled) in all snapshot builds:

. So long as you're running the tests with a snapshot build rather than a release build, they should all be on. Unless I've misunderstood... These tests are certainly running locally for me!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now going back to the node feature. I think we need to think as well, do we need this node feature in the production code? If not, we can rethink if it's useful to add it or not.

I don't really understand the node features framework well enough to answer this question. I'll defer to you, or we can go ask someone who knows this stuff better.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I can easily revert that commit if we don't need it.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But BWC is about running an old test suite against a new cluster, right?

No, if I understand it correctly, we run the new test suite to previous versions, including mixed clusters.

Feature flags are enabled (and can't be disabled) in all snapshot builds:

That is true, but we also run the test suites as release, so we explicitly enable the feature flags because we might end up with mixed versions.

I don't really understand the node features framework well enough to answer this question. I'll defer to you, or we can go ask someone who knows this stuff better.

There are certain features that need to be supported by all nodes before they can take effect. For example, hypothetically, we might want to allow users to use the failure store, only when all nodes support the failure store. Otherwise, you might have a node adding failure store but the bulk request goes to a different node that doesn't know about the failure store. This can be ensure via a node feature.

A node feature can be used also in yaml tests to see if the feature is supported.

So, I was wondering here, considering that:

  • We are still behind a feature flag.
  • I think, without being 100% sure, that the capabilities of these APIs will remain there and will be in the same version as this.

Maybe it depends if we are planning to introduce a general node feature for the failure store. @jbaiera & @dakrone . Any thoughts?

Could we bundle this under a failure store node feature when we remove the feature flag?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, having slept on it, I think this:

Could we bundle this under a failure store node feature when we remove the feature flag?

is the conclusion I'm leaning towards as well.

Reasoning:

  1. We will not remove the feature flag until the implementation is complete.
  2. Features are not, as I understand it, considered supported while they're behind a feature flag (the user would have to set the property or use a snapshot build, and figure out how to use it without documentation).
  3. Therefore we don't need to support the case of a mixed cluster in which the feature is enabled on some/all nodes and some/all nodes only have a partial implementation.
  4. Therefore there is no need for a node feature for the cluster setting bit itself, and I should revert the commit which added it.
  5. However we might want to support the case of a mixed cluster in which the feature is fully-implemented and enabled by default on some nodes and disabled on others (e.g. if we finish the implementation and remove the feature flag in 8.18, and the cluster has a mix of 8.17 and 8.18).
  6. Therefore we might need to use the node feature mechanism for the whole failure store feature.
  7. But that shouldn't be part of this PR — perhaps it should be part of the PR which removes the feature flag, just replacing checks on the feature flag with checks on the cluster feature (so it'll be enabled only if every node has the feature).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed in the meeting, I've removed the node feature and added a capability on the PUT /_cluster/settings API instead (with a comment mentioning that we plan to replace this when we remove the feature gate).

"processors": [
{
"fail": {
"message" : "pipeline go boom"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<3

@PeteGillinElastic
Copy link
Member Author

Thanks again @gmarouli , all comments responded to now.

@PeteGillinElastic
Copy link
Member Author

I've just pushed a commit to do date math resolution on the name before matching it against the patterns in the cluster setting, as discussed on Slack. As part of this, I removed the test in TransportBulkActionTests which was directly calling resolveFailureStoreFromTemplate, since I don't think it added any more coverage than we already have from the test which calls resolveFailureInternal, and it's better to test via something closer to the public API. Let me know if you disagree with that.

@PeteGillinElastic
Copy link
Member Author

Just rounding up where I think we are on this:

There is a thread on BWC, mixed cluster support, node features, and the like here: #118662 (comment).

That's the only open issue that I'm aware of.

But we haven't heard from @jbaiera yet...

@PeteGillinElastic
Copy link
Member Author

Just rounding up where I think we are on this:

There is a thread on BWC, mixed cluster support, node features, and the like here: #118662 (comment).

That's the only open issue that I'm aware of.

But we haven't heard from @jbaiera yet...

Okay, I've pushed changes for the above, so I think I've resolved all the issues I was aware of and this PR is ready for review again. 😮‍💨

@jbaiera jbaiera self-requested a review December 18, 2024 22:40
Copy link
Member

@jbaiera jbaiera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think everything LGTM pending final approval from @gmarouli. Sorry for jumping in on this late but great work on the discussions that have already happened so far!

Copy link
Contributor

@gmarouli gmarouli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, great work and great collaboration @PeteGillinElastic 🚀

@PeteGillinElastic PeteGillinElastic enabled auto-merge (squash) December 19, 2024 15:21
@PeteGillinElastic PeteGillinElastic merged commit 97e6bb6 into elastic:main Dec 19, 2024
16 checks passed
@elasticsearchmachine
Copy link
Collaborator

💔 Backport failed

Status Branch Result
8.x Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 118662

@PeteGillinElastic
Copy link
Member Author

💚 All backports created successfully

Status Branch Result
8.x

Questions ?

Please refer to the Backport tool documentation

PeteGillinElastic added a commit to PeteGillinElastic/elasticsearch that referenced this pull request Dec 19, 2024
This setting enables or disables the failure store for data streams
based on matching the data stream name against a list of patterns. It
acts as a default, and is overridden if the failure store is
explicitly enabled or disabled either in a component template or using
the data stream options API.

(See the PR for explanations of some of the changes here.)

(cherry picked from commit 97e6bb6)

# Conflicts:
#	server/src/main/java/org/elasticsearch/action/bulk/BulkOperation.java
#	server/src/main/java/org/elasticsearch/node/NodeConstruction.java
PeteGillinElastic added a commit that referenced this pull request Dec 23, 2024
This setting enables or disables the failure store for data streams
based on matching the data stream name against a list of patterns. It
acts as a default, and is overridden if the failure store is
explicitly enabled or disabled either in a component template or using
the data stream options API.

(See the PR for explanations of some of the changes here.)

(cherry picked from commit 97e6bb6)

# Conflicts:
#	server/src/main/java/org/elasticsearch/action/bulk/BulkOperation.java
#	server/src/main/java/org/elasticsearch/node/NodeConstruction.java
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Automatically create backport pull requests when merged :Data Management/Data streams Data streams and their lifecycles >non-issue Team:Data Management Meta label for data/management team v8.18.0 v9.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants