Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Extensions] Add ClusterStateRequest parameter to cluster state transport request #7066

Closed

Conversation

dbwiddis
Copy link
Member

Companion PR on SDK: opensearch-project/opensearch-sdk-java#668

Description

Adds the ability to send a ClusterStateRequest parameter when requesting Cluster State, to limit the values returned (and reduce transport bandwidth). To get the old behavior, use new ClusterStateRequest().all() as the parameter.

Issues Resolved

Fixes SDK opensearch-project/OpenSearch#354

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff
    Commit changes are listed out in CHANGELOG.md file (See: Changelog)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@shwetathareja
Copy link
Member

@dbwiddis : Looking for more context on the change here.
Different API requests support filtering of response depending on the request param. ClusterStateRequest was one which you handled explicitly. To take an example, NodeStatsRequest is another such request.

So with extensions, each request has to be explicitly added in the ExtensionsTransportHandler. What is the thought process here?

@dbwiddis
Copy link
Member Author

dbwiddis commented Apr 10, 2023

@dbwiddis : Looking for more context on the change here. Different API requests support filtering of response depending on the request param. ClusterStateRequest was one which you handled explicitly. To take an example, NodeStatsRequest is another such request.

So with extensions, each request has to be explicitly added in the ExtensionsTransportHandler. What is the thought process here?

Great question, @shwetathareja, and there may be a more general approach needed eventually for "transport actions" in general, but I'm not sure it's clear yet what that approach is.

In Extensions, our first choice is to use existing clients for these requests. Most API requests are in the specification and automatically generate the appropriate Request/Response classes in the various clients, such as opensearch-java. However, this is not one such request:

  • It is not listed in the Cluster APIs which are the set of APIs currently handled by the clients.
  • There is a REST API for it, the RestClusterStateAction but the routes starting with /_cluster/state are not handled by any existing clients.
    • Further, it's really really hard to even find documentation for this API on the OpenSearch website. Entering "/_cluster/state" into the search box on the OpenSearch website yields many pages, none of which list that API. I had to resort to a Google Search which doesn't point to any OpenSearch documentation which includes that API. The one OpenSearch page linked doesn't include it; there are multiple third-party sites and one for AOS listing it as a common command but not telling how to use it.
    • The clients do provide a performRequest() functionality to send an arbitrary REST request, and I tried to do this. See Why not to get cluster state with ClusterStateRequest using RestClient opensearch-sdk-java#667 for the result of a few hours of my effort prior to submitting this PR. It was possible to recreate the needed /_cluster/state API to send the request. The problem was that the result of that request is just JSON that needed to be parsed back into a ClusterStateResponse object.
      • And this is simply not practical (again see that draft PR for how far I got and where I gave up). There are plenty of toXContent methods converting that response into XContent, but very few fromXContent methods for parsing the Response JSON generated in performRequest() into an actual ClusterStateResponse object. I won't say it's impossible, but I will say that the effort is just not worth it. The correct answer if we want to allow deserialization of the JSON is to implement fromXContent() parsing in the ClusterStateResponse and all its subordinate objects (including any which implement the Custom interface which would mean adding another interface method there requiring such parsing be implemented on all existing Custom implementations... which would be a breaking change)
      • It would be nice if the clients implemented this, which I assume requires adding this API to the Cluster API spec. I don't know why it's not there already. This is probably the right answer. But doesn't unblock our plugin migration and extension development efforts.
  • So this takes us back to the only option really being using the Transport APIs. The ClusterStateRequest and ClusterStateResponse implement Writeable and it's easy to convert them back and forth into the byte streams needed to send them to a handler. This opens up two current possibilities:
    • Directly send them as they are in this PR, and handle them explicitly as individual exceptions to our general rule of using clients, since (a) there's no client implementation and (b) the manual parsing is untenable.
    • Send them using the Extension ProxyAction capability (implemented as ExtensionTransportAction for plugins and RemoteExtensionTransportAction for extensions) which is currently designed to serialize requests/responses like this for execution on another extension. It's not too much work to extend that code to execute OpenSearch actions, and this is a possibility. I made a few draft attempts at integrating with that code before submitting this PR, but they started to get overly complex.
  • I don't think this is going to be the last such request, and I do think there's probably room for a more generic handling of multiple TransportActions since all the involved classes (ActionRequest/ActionResponse) are serializable.
    • There's some middle ground between single use cases here, and overly broad "we accept anything you can serialize" like the ExtensionTransportAction setup. I do expect that as we find more use cases we'll end up combining them at some reasonable point. I don't want to design that now until I see how big the problem is going to be.

I hope this explains my thought process. I'm sure @saratvemulapalli probably has some comments on this as he's trying to integrate the transport handling with protobuf.

@dbwiddis dbwiddis force-pushed the cluster-state-request branch from 50f0569 to ff4380c Compare April 10, 2023 23:01
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@codecov-commenter
Copy link

Codecov Report

Merging #7066 (ff4380c) into main (3ba333b) will decrease coverage by 0.15%.
The diff coverage is 53.40%.

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

@@             Coverage Diff              @@
##               main    opensearch-project/OpenSearch#7066      +/-   ##
============================================
- Coverage     70.77%   70.63%   -0.15%     
- Complexity    59367    59447      +80     
============================================
  Files          4825     4852      +27     
  Lines        284369   285165     +796     
  Branches      41021    41112      +91     
============================================
+ Hits         201263   201423     +160     
- Misses        66603    67182     +579     
- Partials      16503    16560      +57     
Impacted Files Coverage Δ
...search/client/SearchPipelineRequestConverters.java 0.00% <0.00%> (ø)
...eline/common/SearchPipelineCommonModulePlugin.java 0.00% <0.00%> (ø)
...ion/admin/cluster/node/info/NodesInfoResponse.java 3.03% <0.00%> (-0.10%) ⬇️
...search/action/search/GetSearchPipelineRequest.java 0.00% <0.00%> (ø)
...earch/action/search/GetSearchPipelineResponse.java 0.00% <0.00%> (ø)
.../org/opensearch/client/support/AbstractClient.java 31.10% <0.00%> (-0.66%) ⬇️
...search/cluster/service/ClusterManagerTaskKeys.java 0.00% <ø> (ø)
...pensearch/common/settings/FeatureFlagSettings.java 50.00% <ø> (ø)
...a/org/opensearch/extensions/ExtensionsManager.java 46.83% <0.00%> (-0.19%) ⬇️
...a/org/opensearch/plugins/SearchPipelinePlugin.java 0.00% <0.00%> (ø)
... and 39 more

... and 469 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@shwetathareja
Copy link
Member

@dbwiddis Thanks for all the details. It helped me understand better.

The correct answer if we want to allow deserialization of the JSON is to implement fromXContent() parsing in the ClusterStateResponse and all its subordinate objects

Yes, if we want to expose the whole of cluster state to users, then implementing it via fromXContent() is the right way forward.

But, I wonder if we want to expose all _cluster/state API constructs to clients. I foresee lot of challenges:

  1. To start with "custom" that you mentioned above. Plugins can add any custom data in cluster state and it would be breaking change if fromXContent is enforced. Also, plugins can choose not to send certain data in API response and mark it private.
  2. Today, these internal data structures are updated just thinking about server side backward compatibility. Now changing anything in those structures could potentially break clients and would result in too much overhead.
  3. If you look at the toXContent method of some of the classes you will notice, not all fields are serialized, only a subset which might not be sufficient to deserialize the object back to current ClusterState data structures. e.g. anywhere Index is serialized in routing information, only its name is serialized not Index UUID. Metadata classes have fromXContent already implemented as it is used to serde to and from disk as well.

I see this API more of helping in debugging and providing it as official Cluster APIs in client could cause lot of maintenance overhead. Until community feels the need with real use cases, we should refrain from exposing it.

I also want to warn that the response of this API could be huge in the order of MBs (for large cluster it could easily be > 100MB depending on their mappings). This could cause significant overhead in terms of cpu, memory and keeping transport threads busy if it is being accessed a lot over the transport across extensions. Today, plugins can access the direct in-memory object from ClusterService as opposed to making transport calls. Also if each extension end up storing the de-serialized cluster state at its end, then overall it could be a significant overhead in terms of memory. I would like to brainstorm/ discuss more on access of cluster state across extensions @dbwiddis / @saratvemulapalli

There's some middle ground between single use cases here, and overly broad "we accept anything you can serialize" like the ExtensionTransportAction setup. I do expect that as we find more use cases we'll end up combining them at some reasonable point. I don't want to design that now until I see how big the problem is going to be.

Yes, i would also like to understand the use cases better here before we design a generic solution here for exposing any TransportAction.

@dbwiddis
Copy link
Member Author

dbwiddis commented Apr 11, 2023

Yes, i would also like to understand the use cases better here before we design a generic solution here for exposing any TransportAction.

Thanks for all the great insight, @shwetathareja ... some of that belongs on the bug opensearch-project/documentation-website#3784 I filed as well.

We had previously (see the diff) returned the state() from the ClusterService as you mentioned.

As for the specific use case of this request, we are migrating the Anomaly Detection plugin to an extension and have encountered this code in the existing plugin:

        ClusterStateRequest clusterStateRequest = new ClusterStateRequest()
            .clear()
            .indices(AnomalyDetectionIndices.ALL_AD_RESULTS_INDEX_PATTERN)
            .metadata(true)
            .local(true)
            .indicesOptions(IndicesOptions.strictExpand());

        adminClient.cluster().state(clusterStateRequest, ActionListener.wrap(clusterStateResponse -> {

The ClusterStateResponse from the ClusterService does not allow filtering by indices, and as you indicated, the response can be quite large for large clusters with many mappings.

How would you suggest filtering the ClusterService response to implement this use case?

@shwetathareja
Copy link
Member

Thanks @dbwiddis for sharing the AnomalyDetection Plugin use case.
So, it looks like it was always making local transport call to the node to fetch the AD indices metadata. With your current change, for AD plugin it will be status quo. It will not add any extra overhead.

How would you suggest filtering the ClusterService response to implement this use case?

For now, for AD we can keep the logic as is.

The bigger question for later is that the plugins which had direct access to ClusterService and were using ClusterState, now would need to resort to transport calls.

@dblock
Copy link
Member

dblock commented Apr 12, 2023

It looks like we are trying to expose cluster state for an extension to know whether an index exists. This seems like a valid concern. Extensions should be able to create/update/delete/check for existence of indices. It should be a first class API.

Cluster state is an internal construct of a cluster-based system. Cluster state or nodes wouldn't exist in a serverless environment, for example, which tells me that is not an API that should not need to be exposed to extensions.

@dbwiddis
Copy link
Member Author

Good point, @dblock ... we should probably just use the Index API for this particular use case. I was trying to migrate existing code (Minimizing the diff helps migration) but this looks like a case where we're digging too deep into the internals and we should use a better (and supported) API for it.

@shwetathareja
Copy link
Member

shwetathareja commented Apr 13, 2023

@dbwiddis : If you are planning to use TransportAction for get index API (use local=true for that as well), this API doesn't return system indices by default. AD index would be a system index. There is a way around using headers, you should first check (could be tricky).

Also, how is AD using the response as of today. There could be difference in cluster state API response and get index API for an index.

@dblock This is an interesting discussion how system indices should be accessed by extensions. Should it be same as any other customer indices. Ideally no, otherwise customer can mess it up, are we relying on access control alone?
Also is it only indices that extension would access or any other metadata from cluster state?

@dbwiddis
Copy link
Member Author

Based on feedback about sending cluster state over transport, I'm closing this PR in favor of targeted API requests for the information we need.

@dbwiddis dbwiddis closed this Apr 13, 2023
@dbwiddis
Copy link
Member Author

Also, how is AD using the response as of today. There could be difference in cluster state API response and get index API.

@shwetathareja there's two uses in AD extension.

The one cited in this thread which this PR was intended to implement was a user-provided index.

The system index use case is used via the cluster service state() method (that this PR was intended to replace!) which probably gives way too much information but apparently does do system indices. I'm sure we can figure out a way to make those calls more efficient. See opensearch-project/opensearch-sdk-java#674

@saratvemulapalli
Copy link
Member

saratvemulapalli commented Apr 13, 2023

hope this explains my thought process. I'm sure @saratvemulapalli probably has some comments on this as he's trying to integrate the transport handling with protobuf.

Cluster state is used by AD to find index information. +1 to @dblock an API to get an index should solve the problem.
If it is a system index @peternied has a proposal to expose access to system indices for extensions[1].

I would like to brainstorm/ discuss more on access of cluster state across extensions @dbwiddis / @saratvemulapalli

I am sure there could be other cases where extensions would need to access cluster state, but we haven't seen anything other than AD for now. Until that day comes with all the feedback here, exposing it as an API is not worth it.

[1]opensearch-project/security#2530

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEATURE] Add ability to pass ClusterStateRequest object when requesting Cluster State
5 participants