Create new feature interface for querying features on a node/cluster #100330

thecoop · 2023-10-05T10:59:30Z

Expose features on DiscoveryNode and node startup to publish regular feature, and handle historical features

FeatureService is the main class handling this. It is largely static, as the feature specs exposed by a node are a static property of the code. And getting a FeatureService implementation into every DiscoveryNode instance is not feasible.

As part of IT tests, feature specifications are wound through the the internal clusters to MockNodes where they are added to each node's features without requiring SPI.

elasticsearchmachine · 2023-10-05T11:00:18Z

Hi @thecoop, I've created a changelog YAML for you.

williamrandolph · 2023-10-05T14:53:12Z

While it's probably been discussed in planning, we're probably just about at the last stage where we can discuss naming. The term "feature" is already used in Elasticsearch to describe a group of system indices and data streams. We have a "get features" API, a feature migration API, and users can specify features to include in snapshot operations.

I haven't brought this up before because the concepts seem conceptually distant enough to coexist under the same name in our codebase. And I have no ideas for an alternative. But I think it's worth mentioning just in case any reviewers spot a possibility for confusion.

ldematte

Looks good!
The only concern I have about the statics is testability (ovverriding/mocking features in tests).
Suppose I want to have a specific set of features in my tests; how would I mock this? Do I need to specify a FeatureSpecification? Since this load statically, would this interfere with other tests? Can we get around this somehow (e.g. via specific classloaders for ServiceLoader or something similar)
Or do you see other ways, like overriding DiscoveryNode.hasFeature, or something else?

server/src/main/java/org/elasticsearch/cluster/node/DiscoveryNode.java

...ework/src/main/java/org/elasticsearch/cluster/coordination/CoordinationStateTestCluster.java

ldematte · 2023-10-06T09:01:03Z

test/framework/src/main/java/org/elasticsearch/cluster/node/DiscoveryNodeUtils.java

@@ -154,7 +161,23 @@ public DiscoveryNode build() {
                versionInfo = new VersionInformation(version, minIndexVersion, maxIndexVersion);
            }

-            return new DiscoveryNode(name, id, ephemeralId, hostName, hostAddress, address, attributes, roles, versionInfo, externalId);
+            if (features == null) {
+                features = FeatureService.readFeatures();


Not sure about this or if it's better to have Set.of()

If you're not specifying any features, you want it to have the features of the local codebase so that your new shiny feature is automatically enabled

ldematte · 2023-10-06T09:15:58Z

getting a FeatureService implementation into every DiscoveryNode instance is not feasible.

Why not?
What about not passing neither a Set, or a FeatureService, but a simple interface

FeatureProvider extends Writeable {
   boolean hasFeature(NodeFeature feature)
   // plus FeatureProvider readFrom(in) or others, if needed
}

closing over both the Set and FeatureService (if needed?)
That would make testing and mocking easier.

thecoop · 2023-10-06T09:33:35Z

I've added a reset method to ESTestCase that clears the specs before every test. This will stop test specs from leaking into each other

elasticsearchmachine · 2023-10-06T10:30:33Z

Pinging @elastic/es-core-infra (Team:Core/Infra)

elasticsearchmachine · 2023-10-06T10:30:34Z

Pinging @elastic/es-distributed (Team:Distributed)

thecoop · 2023-10-06T10:38:43Z

There are two main reasons I've used statics for FeatureService:

Expediency now. DiscoveryNode is used in a lot of places (hundreds and hundreds), and adding a parameter to the constructor specifying the features to expose will lead to an awful lot of churn
Ease of use later. Currently, everything uses Version, that is either inferred or specified once. Over time, we will be migrating to use features over Version. If we add a parameter for features to be specified manually, they will need to be added everywhere, and devs testing with DiscoveryNode will need to remember to add the feature to the list to get their new code to work at all. Here, if they declare an SPI for their feature specs, the feature will be picked up by DiscoveryNode and internal test nodes automatically, so it will Just Work.

That said, I don't like the use of statics here, but it's a tradeoff with ease-of-use for the majority use case.

thecoop · 2023-10-06T10:45:30Z

Is there a potential problem here with upgrades across the version boundary...

All nodes on 8.11
All nodes (except for master) upgraded to 8.12 with several features declared, cluster state does not have any features due to serialization.
Master upgraded to 8.12. Another node becomes master.

At step 3, are all the DiscoveryNodes re-registered with the new master, and so send the master their features, or does it continue to use the cluster state created by the 8.11 master, so it won't have any features in it until each node re-joins the cluster at some point later?

thecoop · 2023-10-09T12:56:42Z

Why not? What about not passing neither a Set, or a FeatureService, but a simple interface closing over both the Set and FeatureService (if needed?) That would make testing and mocking easier.

I've had a look at this, and the difficulty comes from deserialization - DiscoveryNode needs to get an instance of this from nothing within the deser constructor. Maybe when the serialization is consolidated in DiscoveryNodes there'll be a way to pull an instance from the parent DiscoveryNodes, but that's currently not possible

ldematte · 2023-10-10T06:46:52Z

DiscoveryNode needs to get an instance of this from nothing within the deser constructor

Yeah, that's definitely the tricky bit; it clashes with the goal is to minimize what we serialize/deserialize. Of course you can complicate the logic in the injected FeatureProvider, to keep the historical set separate and re-construct its internal set from the deserialization (FeatureProvider readFrom(in)), but then how do you inject it in the first place? We are back to square one and we have added complexity. What we have now is better.

ldematte

LGTM

server/src/main/java/org/elasticsearch/features/FeatureService.java

rjernst

Thanks for getting this together so quickly! I have a few thoughts.

rjernst · 2023-10-10T13:45:11Z

server/src/main/java/org/elasticsearch/node/Node.java

-        final Function<Settings, PluginsService> pluginServiceCtor,
+        Environment initialEnvironment,
+        Function<Settings, PluginsService> pluginServiceCtor,
+        Collection<? extends FeatureSpecification> additionalFeatureSpecs,


Instead of adjusting the ctor of Node, can we make this available for test overriding like other capabilities here? Have a method which loads the features (or feature service), and then have MockNode use a special version of that if a marker plugin exists.

I tried that, but there's an issue with subclass-superclass construction - FeatureService needs to be created in the Node constructor, but if we override it in MockNode the field storing the feature specs in MockNode won't be assigned when the superclass Node constructor runs. All the other overrides don't access fields in MockNode

MockNode doesn't need to contain any additional members, so there shouldn't be any field access necessary. The pattern used throughout Node is to have a protected, overridable method, eg newFeatureService(). The MockNode ctor wouldn't be affected, only the overridable method which would be called from the Node ctor. See for example createBigArrays or newSearchService (we are inconsistent about naming...).

Could you give an example by the change you're thinking of? Particularly how the node-specific featurespecs get into the Node instance constructor

The refactor in #100768 will make it a lot easier to introduce this to mock nodes

rjernst · 2023-10-10T13:46:17Z

server/src/main/java/org/elasticsearch/node/Node.java

@@ -473,6 +476,9 @@ protected Node(
            SettingsExtension.load().forEach(e -> additionalSettings.addAll(e.getSettings()));
            client = new NodeClient(settings, threadPool);

+            FeatureService.registerSpecificationsFrom(pluginsService, additionalFeatureSpecs);


Instead of having a static method here, coudln't this be passed into the ctor on the next line?

I wanted to make it clear what is stored statically and what is in the service instance. Specifying it in the constructor hides the fact that the specs are actually stored statically.

While the specs may be stored statically, it is better for tests if they are non static (so eg each node can be independent). Is there a reason the specs couldn't be loaded in the ctor altogether?

See the discussion #100330 (comment) and #100330 (comment) - DiscoveryNode needs to get the feature specs from nothing, which means they need to be static.

The features can be overridden per-node by implementing ESIntegTestCase.nodeFeatureSpecifications, but I'm not sure if there's a better way to do it - see #100330 (comment)

rjernst · 2023-10-10T13:51:16Z

server/src/main/java/org/elasticsearch/features/FeatureSpecification.java

+    /**
+     * Returns information on historical features that should be added to all nodes at or above the {@link Version} specified.
+     */
+    default Map<NodeFeature, Version> getHistoricalFeatures() {


Can we avoid using Version here? Internally to compare these historic versions we need a very reduced concept of Version, so I wonder if we could hardcode that there, so as not to hold up overall reduction of uses of Version.

Historical features need to be applied based on what release version nodes are - their Version. Similar to the issues the ML team are having with converting MlConfigVersion, we need some way to identify previous release versions consistently.

We will probably replace Version with something else later on, and change this interface accordingly, but for the moment, Version is what we've got.

rjernst · 2023-10-10T14:29:32Z

server/src/main/java/org/elasticsearch/features/FeatureService.java

+    private static volatile Set<String> NODE_FEATURES;
+    private static volatile NavigableMap<Version, Set<String>> HISTORICAL_FEATURES;
+
+    private static void readSpecs() {


nit: the terminology here is confusing, this really should be server features? ie non plugins

This loads the FeatureSpecification implementations, not the features themselves

It's still specs from server, vs those from plugins, right? I think we should have some distinction, otherwise it seems like these are all the specs, when they're only some.

Not necessarily, in unit tests this will read all the specs as the modules under test are loaded by the main classloader

server/src/main/java/org/elasticsearch/features/FeatureService.java

server/src/main/java/org/elasticsearch/features/FeatureEra.java

thecoop · 2023-10-13T13:24:23Z

Is there a potential problem here with upgrades across the version boundary...

All nodes on 8.11

All nodes (except for master) upgraded to 8.12 with several features declared, cluster state does not have any features due to serialization.

Master upgraded to 8.12. Another node becomes master.

At step 3, are all the DiscoveryNodes re-registered with the new master, and so send the master their features, or does it continue to use the cluster state created by the 8.11 master, so it won't have any features in it until each node re-joins the cluster at some point later?

We do indeed need to worry about this, with a fixup listener similar to TransportVersionsFixupListener. This will also require adding features to node-info responses

thecoop · 2023-10-13T13:27:04Z

We can ensure this works for serverless by ensuring the feature api gets pushed into all serverless deployments with no features declared (so the default of the empty set is actually correct). We can then happily add features to serverless. This is only really an issue for release deployments, where we will be going from no featureset at all to a populated featureset.

DaveCTurner

This may sound like déja vu, but I'm not convinced we want to add this to DiscoveryNode directly. I worry about subtle BwC problems related to applying a cluster state from nodes that don't have features in the DiscoveryNode object. I know we said we could add a post-upgrade-fix-up thing, but that won't fix up any DiscoveryNode objects that already exist and have been stored somewhere else (e.g. the keys of the map in org.elasticsearch.transport.ClusterConnectionManager#connectedNodes).

I expect these features to be cluster-wide things - either the cluster knows about feature X or it doesn't, and the fact that this computation is based on the cluster membership is kind of an implementation detail. So I'd encourage thinking about making these things separate from the DiscoveryNode objects shared in the transport handshake, rather like we did with TransportVersion.

We also pass DiscoveryNode objects over the wire in various places where we want to just identify a node. They're already rather overweight for that purpose unfortunately (see e.g. #99744) but I reckon this set of features is going to grow pretty fast and make that problem a whole lot worse.

Do we need some kind of join barrier for features? Once we've reported that a cluster supports feature X, it seems that it might be unsafe to then go back into a state where it doesn't.

thecoop · 2023-10-16T08:00:09Z

@DaveCTurner Yes, the plan is to use features for a join barrier, as a feature ratchet - a node won't be able to join the cluster if it is missing one or more features that are present in every other node in the cluster (with a few exceptions).

I would say this does belong alongside Version on DiscoveryNode, as it is supplanting the concept & use cases of node version. I agree that DiscoveryNode is becoming too big, but that larger refactor should be done separately to this piece of work. This PR is blocking lots of other significant work, so we should get this in soon.

DaveCTurner · 2023-10-16T08:09:48Z

This PR is blocking lots of other significant work, so we should get this in soon.

That is not a good reason to compromise on the design IMO. These BwC concerns are very difficult to resolve once released, and I'd rather not repeat the mistakes we made with pluggable node roles in the 7.x series.

thecoop · 2023-10-16T15:54:47Z

After discussing with @DaveCTurner and @rjernst, we're going to take a slightly different approach to this:

Feature information will only be accessible consolidated across the whole cluster, not on individual DiscoveryNode objects.
Feature sets will be sent across on initial connection with the master (and updated on master re-joins) and consolidated in DiscoveryNodes or ClusterState itself
Any code that is checking the features of individual nodes (via Version) should not be doing that, and in time needs to be refactored to only check across the whole cluster
There might be a use case for checking features across a subset of the cluster (eg partition by role), we will need to see if it is needed as we move more code over to features.

I'll do an initial pass over code that is checking the version of individual nodes & see if there's any blockers on changing those to cluster-wide checks

thecoop · 2023-10-16T16:14:04Z

I'm closing this PR & will open a new one for the new implementation, to help clear the floor a bit

ldematte · 2023-10-16T16:34:02Z

Any code that is checking the features of individual nodes should not be doing that, and in time needs to be refactored to only check across the whole cluster

Sounds right. I only found two instances of this in tests so far; one was easily re-written as a check against the whole cluster. The second one was about checking different things (warning headers, IIRC) if a cluster was totally on a version that had the feature or partially. But I suspect that could also be re-written with a different assertion

thecoop added 3 commits October 5, 2023 11:38

Create initial feature classes

12aa684

Add features to DiscoveryNode

0038832

Update DiscoveryNode callers

0aceda2

thecoop added >enhancement :Core/Infra/Core Core issues without another label labels Oct 5, 2023

elasticsearchmachine added the v8.12.0 label Oct 5, 2023

Update docs/changelog/100330.yaml

1e567bb

thecoop added 4 commits October 5, 2023 12:57

Update tests for new xcontent field

40de7f4

Add first set of tests to FeatureService

e9a862c

Tidy up a bit

20d19a4

Add a couple more tests

fe989aa

ldematte reviewed Oct 6, 2023

View reviewed changes

thecoop added the :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. label Oct 6, 2023

More test changes

0ff2e61

Wire up features to internal cluster nodes

7ebc1f3

thecoop marked this pull request as ready for review October 6, 2023 10:30

elasticsearchmachine added Team:Core/Infra Meta label for core/infra team Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. labels Oct 6, 2023

thecoop requested review from rjernst and williamrandolph October 6, 2023 10:31

Merge remote-tracking branch 'upstream/main' into feature-api

f892a0d

thecoop requested a review from ldematte October 6, 2023 10:45

Merge remote-tracking branch 'upstream/main' into feature-api

b54864f

thecoop added 2 commits October 9, 2023 14:15

Remove unneeded parameter

c0ea764

splotless

2bcd4ca

ldematte approved these changes Oct 10, 2023

View reviewed changes

server/src/main/java/org/elasticsearch/features/FeatureService.java Outdated Show resolved Hide resolved

thecoop added 2 commits October 10, 2023 10:16

Merge remote-tracking branch 'upstream/main' into feature-api

458ed21

Rename method

b2b7c3b

thecoop mentioned this pull request Oct 10, 2023

Prototype of initial feature service for registering features #99944

Closed

rjernst reviewed Oct 10, 2023

View reviewed changes

thecoop added 3 commits October 11, 2023 10:17

Only sort feature sets at output time

401756f

Update tests for ordering

9f5cea8

Remove feature era for now

29e4df0

thecoop force-pushed the feature-api branch from 880b1af to 29e4df0 Compare October 11, 2023 09:57

thecoop requested a review from rjernst October 11, 2023 10:06

thecoop added 3 commits October 11, 2023 11:33

Merge remote-tracking branch 'upstream/main' into feature-api

3d747a1

Merge remote-tracking branch 'upstream/main' into feature-api

d84a4e6

Merge remote-tracking branch 'upstream/main' into feature-api

1777098

DaveCTurner reviewed Oct 14, 2023

View reviewed changes

thecoop requested a review from ldematte October 16, 2023 16:11

thecoop closed this Oct 16, 2023

thecoop mentioned this pull request Oct 17, 2023

Create new feature API for querying features present on a cluster (v2) #100974

Merged

thecoop deleted the feature-api branch October 30, 2023 15:12

Create new feature interface for querying features on a node/cluster #100330

Create new feature interface for querying features on a node/cluster #100330

Conversation

thecoop commented Oct 5, 2023 • edited Loading

elasticsearchmachine commented Oct 5, 2023

williamrandolph commented Oct 5, 2023

ldematte left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ldematte commented Oct 6, 2023

thecoop commented Oct 6, 2023

elasticsearchmachine commented Oct 6, 2023

elasticsearchmachine commented Oct 6, 2023

thecoop commented Oct 6, 2023 • edited Loading

thecoop commented Oct 6, 2023

thecoop commented Oct 9, 2023

ldematte commented Oct 10, 2023

ldematte left a comment

Choose a reason for hiding this comment

rjernst left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thecoop Oct 12, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thecoop Oct 11, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thecoop Oct 12, 2023 • edited Loading

Choose a reason for hiding this comment

thecoop commented Oct 13, 2023 • edited Loading

thecoop commented Oct 13, 2023

DaveCTurner left a comment

Choose a reason for hiding this comment

thecoop commented Oct 16, 2023

DaveCTurner commented Oct 16, 2023

thecoop commented Oct 16, 2023 • edited Loading

thecoop commented Oct 16, 2023

ldematte commented Oct 16, 2023

thecoop commented Oct 5, 2023 •

edited

Loading

thecoop commented Oct 6, 2023 •

edited

Loading

thecoop Oct 12, 2023 •

edited

Loading

thecoop Oct 11, 2023 •

edited

Loading

thecoop Oct 12, 2023 •

edited

Loading

thecoop commented Oct 13, 2023 •

edited

Loading

thecoop commented Oct 16, 2023 •

edited

Loading