Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create new feature interface for querying features on a node/cluster #100330

Closed
wants to merge 27 commits into from

Conversation

thecoop
Copy link
Member

@thecoop thecoop commented Oct 5, 2023

Expose features on DiscoveryNode and node startup to publish regular feature, and handle historical features

FeatureService is the main class handling this. It is largely static, as the feature specs exposed by a node are a static property of the code. And getting a FeatureService implementation into every DiscoveryNode instance is not feasible.

As part of IT tests, feature specifications are wound through the the internal clusters to MockNodes where they are added to each node's features without requiring SPI.

@thecoop thecoop added >enhancement :Core/Infra/Core Core issues without another label labels Oct 5, 2023
@elasticsearchmachine
Copy link
Collaborator

Hi @thecoop, I've created a changelog YAML for you.

@williamrandolph
Copy link
Contributor

While it's probably been discussed in planning, we're probably just about at the last stage where we can discuss naming. The term "feature" is already used in Elasticsearch to describe a group of system indices and data streams. We have a "get features" API, a feature migration API, and users can specify features to include in snapshot operations.

I haven't brought this up before because the concepts seem conceptually distant enough to coexist under the same name in our codebase. And I have no ideas for an alternative. But I think it's worth mentioning just in case any reviewers spot a possibility for confusion.

Copy link
Contributor

@ldematte ldematte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!
The only concern I have about the statics is testability (ovverriding/mocking features in tests).
Suppose I want to have a specific set of features in my tests; how would I mock this? Do I need to specify a FeatureSpecification? Since this load statically, would this interfere with other tests? Can we get around this somehow (e.g. via specific classloaders for ServiceLoader or something similar)
Or do you see other ways, like overriding DiscoveryNode.hasFeature, or something else?

@@ -154,7 +161,23 @@ public DiscoveryNode build() {
versionInfo = new VersionInformation(version, minIndexVersion, maxIndexVersion);
}

return new DiscoveryNode(name, id, ephemeralId, hostName, hostAddress, address, attributes, roles, versionInfo, externalId);
if (features == null) {
features = FeatureService.readFeatures();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about this or if it's better to have Set.of()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're not specifying any features, you want it to have the features of the local codebase so that your new shiny feature is automatically enabled

@ldematte
Copy link
Contributor

ldematte commented Oct 6, 2023

getting a FeatureService implementation into every DiscoveryNode instance is not feasible.

Why not?
What about not passing neither a Set, or a FeatureService, but a simple interface

FeatureProvider extends Writeable {
   boolean hasFeature(NodeFeature feature)
   // plus FeatureProvider readFrom(in) or others, if needed
}

closing over both the Set and FeatureService (if needed?)
That would make testing and mocking easier.

@thecoop thecoop added the :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. label Oct 6, 2023
@thecoop
Copy link
Member Author

thecoop commented Oct 6, 2023

I've added a reset method to ESTestCase that clears the specs before every test. This will stop test specs from leaking into each other

@thecoop thecoop marked this pull request as ready for review October 6, 2023 10:30
@elasticsearchmachine elasticsearchmachine added Team:Core/Infra Meta label for core/infra team Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. labels Oct 6, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra (Team:Core/Infra)

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@thecoop
Copy link
Member Author

thecoop commented Oct 6, 2023

There are two main reasons I've used statics for FeatureService:

  1. Expediency now. DiscoveryNode is used in a lot of places (hundreds and hundreds), and adding a parameter to the constructor specifying the features to expose will lead to an awful lot of churn
  2. Ease of use later. Currently, everything uses Version, that is either inferred or specified once. Over time, we will be migrating to use features over Version. If we add a parameter for features to be specified manually, they will need to be added everywhere, and devs testing with DiscoveryNode will need to remember to add the feature to the list to get their new code to work at all. Here, if they declare an SPI for their feature specs, the feature will be picked up by DiscoveryNode and internal test nodes automatically, so it will Just Work.

That said, I don't like the use of statics here, but it's a tradeoff with ease-of-use for the majority use case.

@thecoop
Copy link
Member Author

thecoop commented Oct 6, 2023

Is there a potential problem here with upgrades across the version boundary...

  1. All nodes on 8.11
  2. All nodes (except for master) upgraded to 8.12 with several features declared, cluster state does not have any features due to serialization.
  3. Master upgraded to 8.12. Another node becomes master.

At step 3, are all the DiscoveryNodes re-registered with the new master, and so send the master their features, or does it continue to use the cluster state created by the 8.11 master, so it won't have any features in it until each node re-joins the cluster at some point later?

@thecoop thecoop requested a review from ldematte October 6, 2023 10:45
@thecoop
Copy link
Member Author

thecoop commented Oct 9, 2023

Why not? What about not passing neither a Set, or a FeatureService, but a simple interface closing over both the Set and FeatureService (if needed?) That would make testing and mocking easier.

I've had a look at this, and the difficulty comes from deserialization - DiscoveryNode needs to get an instance of this from nothing within the deser constructor. Maybe when the serialization is consolidated in DiscoveryNodes there'll be a way to pull an instance from the parent DiscoveryNodes, but that's currently not possible

@ldematte
Copy link
Contributor

DiscoveryNode needs to get an instance of this from nothing within the deser constructor

Yeah, that's definitely the tricky bit; it clashes with the goal is to minimize what we serialize/deserialize. Of course you can complicate the logic in the injected FeatureProvider, to keep the historical set separate and re-construct its internal set from the deserialization (FeatureProvider readFrom(in)), but then how do you inject it in the first place? We are back to square one and we have added complexity. What we have now is better.

Copy link
Contributor

@ldematte ldematte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@rjernst rjernst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for getting this together so quickly! I have a few thoughts.

final Function<Settings, PluginsService> pluginServiceCtor,
Environment initialEnvironment,
Function<Settings, PluginsService> pluginServiceCtor,
Collection<? extends FeatureSpecification> additionalFeatureSpecs,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of adjusting the ctor of Node, can we make this available for test overriding like other capabilities here? Have a method which loads the features (or feature service), and then have MockNode use a special version of that if a marker plugin exists.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried that, but there's an issue with subclass-superclass construction - FeatureService needs to be created in the Node constructor, but if we override it in MockNode the field storing the feature specs in MockNode won't be assigned when the superclass Node constructor runs. All the other overrides don't access fields in MockNode

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MockNode doesn't need to contain any additional members, so there shouldn't be any field access necessary. The pattern used throughout Node is to have a protected, overridable method, eg newFeatureService(). The MockNode ctor wouldn't be affected, only the overridable method which would be called from the Node ctor. See for example createBigArrays or newSearchService (we are inconsistent about naming...).

Copy link
Member Author

@thecoop thecoop Oct 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you give an example by the change you're thinking of? Particularly how the node-specific featurespecs get into the Node instance constructor

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The refactor in #100768 will make it a lot easier to introduce this to mock nodes

@@ -473,6 +476,9 @@ protected Node(
SettingsExtension.load().forEach(e -> additionalSettings.addAll(e.getSettings()));
client = new NodeClient(settings, threadPool);

FeatureService.registerSpecificationsFrom(pluginsService, additionalFeatureSpecs);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of having a static method here, coudln't this be passed into the ctor on the next line?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to make it clear what is stored statically and what is in the service instance. Specifying it in the constructor hides the fact that the specs are actually stored statically.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While the specs may be stored statically, it is better for tests if they are non static (so eg each node can be independent). Is there a reason the specs couldn't be loaded in the ctor altogether?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the discussion #100330 (comment) and #100330 (comment) - DiscoveryNode needs to get the feature specs from nothing, which means they need to be static.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The features can be overridden per-node by implementing ESIntegTestCase.nodeFeatureSpecifications, but I'm not sure if there's a better way to do it - see #100330 (comment)

/**
* Returns information on historical features that should be added to all nodes at or above the {@link Version} specified.
*/
default Map<NodeFeature, Version> getHistoricalFeatures() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid using Version here? Internally to compare these historic versions we need a very reduced concept of Version, so I wonder if we could hardcode that there, so as not to hold up overall reduction of uses of Version.

Copy link
Member Author

@thecoop thecoop Oct 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Historical features need to be applied based on what release version nodes are - their Version. Similar to the issues the ML team are having with converting MlConfigVersion, we need some way to identify previous release versions consistently.

We will probably replace Version with something else later on, and change this interface accordingly, but for the moment, Version is what we've got.

private static volatile Set<String> NODE_FEATURES;
private static volatile NavigableMap<Version, Set<String>> HISTORICAL_FEATURES;

private static void readSpecs() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the terminology here is confusing, this really should be server features? ie non plugins

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This loads the FeatureSpecification implementations, not the features themselves

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's still specs from server, vs those from plugins, right? I think we should have some distinction, otherwise it seems like these are all the specs, when they're only some.

Copy link
Member Author

@thecoop thecoop Oct 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessarily, in unit tests this will read all the specs as the modules under test are loaded by the main classloader

@thecoop
Copy link
Member Author

thecoop commented Oct 13, 2023

Is there a potential problem here with upgrades across the version boundary...

  1. All nodes on 8.11
  2. All nodes (except for master) upgraded to 8.12 with several features declared, cluster state does not have any features due to serialization.
  3. Master upgraded to 8.12. Another node becomes master.

At step 3, are all the DiscoveryNodes re-registered with the new master, and so send the master their features, or does it continue to use the cluster state created by the 8.11 master, so it won't have any features in it until each node re-joins the cluster at some point later?

We do indeed need to worry about this, with a fixup listener similar to TransportVersionsFixupListener. This will also require adding features to node-info responses

@thecoop
Copy link
Member Author

thecoop commented Oct 13, 2023

We can ensure this works for serverless by ensuring the feature api gets pushed into all serverless deployments with no features declared (so the default of the empty set is actually correct). We can then happily add features to serverless. This is only really an issue for release deployments, where we will be going from no featureset at all to a populated featureset.

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may sound like déja vu, but I'm not convinced we want to add this to DiscoveryNode directly. I worry about subtle BwC problems related to applying a cluster state from nodes that don't have features in the DiscoveryNode object. I know we said we could add a post-upgrade-fix-up thing, but that won't fix up any DiscoveryNode objects that already exist and have been stored somewhere else (e.g. the keys of the map in org.elasticsearch.transport.ClusterConnectionManager#connectedNodes).

I expect these features to be cluster-wide things - either the cluster knows about feature X or it doesn't, and the fact that this computation is based on the cluster membership is kind of an implementation detail. So I'd encourage thinking about making these things separate from the DiscoveryNode objects shared in the transport handshake, rather like we did with TransportVersion.

We also pass DiscoveryNode objects over the wire in various places where we want to just identify a node. They're already rather overweight for that purpose unfortunately (see e.g. #99744) but I reckon this set of features is going to grow pretty fast and make that problem a whole lot worse.

Do we need some kind of join barrier for features? Once we've reported that a cluster supports feature X, it seems that it might be unsafe to then go back into a state where it doesn't.

@thecoop
Copy link
Member Author

thecoop commented Oct 16, 2023

@DaveCTurner Yes, the plan is to use features for a join barrier, as a feature ratchet - a node won't be able to join the cluster if it is missing one or more features that are present in every other node in the cluster (with a few exceptions).

I would say this does belong alongside Version on DiscoveryNode, as it is supplanting the concept & use cases of node version. I agree that DiscoveryNode is becoming too big, but that larger refactor should be done separately to this piece of work. This PR is blocking lots of other significant work, so we should get this in soon.

@DaveCTurner
Copy link
Contributor

This PR is blocking lots of other significant work, so we should get this in soon.

That is not a good reason to compromise on the design IMO. These BwC concerns are very difficult to resolve once released, and I'd rather not repeat the mistakes we made with pluggable node roles in the 7.x series.

@thecoop
Copy link
Member Author

thecoop commented Oct 16, 2023

After discussing with @DaveCTurner and @rjernst, we're going to take a slightly different approach to this:

  • Feature information will only be accessible consolidated across the whole cluster, not on individual DiscoveryNode objects.
  • Feature sets will be sent across on initial connection with the master (and updated on master re-joins) and consolidated in DiscoveryNodes or ClusterState itself
  • Any code that is checking the features of individual nodes (via Version) should not be doing that, and in time needs to be refactored to only check across the whole cluster
  • There might be a use case for checking features across a subset of the cluster (eg partition by role), we will need to see if it is needed as we move more code over to features.

I'll do an initial pass over code that is checking the version of individual nodes & see if there's any blockers on changing those to cluster-wide checks

@thecoop thecoop requested a review from ldematte October 16, 2023 16:11
@thecoop
Copy link
Member Author

thecoop commented Oct 16, 2023

I'm closing this PR & will open a new one for the new implementation, to help clear the floor a bit

@thecoop thecoop closed this Oct 16, 2023
@ldematte
Copy link
Contributor

Any code that is checking the features of individual nodes should not be doing that, and in time needs to be refactored to only check across the whole cluster

Sounds right. I only found two instances of this in tests so far; one was easily re-written as a check against the whole cluster. The second one was about checking different things (warning headers, IIRC) if a cluster was totally on a version that had the feature or partially. But I suspect that could also be re-written with a different assertion

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Core/Infra/Core Core issues without another label :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. >enhancement Team:Core/Infra Meta label for core/infra team Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. v8.12.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants