Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request][RFC] Multi-tenancy as a construct in OpenSearch #13341

Open
msfroh opened this issue Apr 23, 2024 · 20 comments
Open

[Feature Request][RFC] Multi-tenancy as a construct in OpenSearch #13341

msfroh opened this issue Apr 23, 2024 · 20 comments
Labels
enhancement Enhancement or improvement to existing feature or request RFC Issues requesting major changes Roadmap:Stability/Availability/Resiliency Project-wide roadmap label Search Search query, autocomplete ...etc v2.15.0 Issues and PRs related to version 2.15.0

Comments

@msfroh
Copy link
Collaborator

msfroh commented Apr 23, 2024

Is your feature request related to a problem? Please describe

I've been involved with multiple projects and issues recently that try deal with the notion of "multi-tenancy" in OpenSearch. That is, they are concerned with identifying, categorizing, managing, and analyzing subsets of traffic hitting an OpenSearch cluster -- usually based on the some information about the source of the traffic (a particular application, a user, a class of users, a particular workload).

Examples include:

  1. [PROPOSAL][Query Sandboxing] Query Sandboxing high level approach. #11173 -- Query sandboxing aims to prevent some subset(s) of traffic from overwhelming the cluster or individual nodes, by imposing resource limits.
  2. [RFC] Query insights framework  #11429 -- Query insights can already capture the top N most expensive queries, but it will be more helpful once we can associate queries with a source. (This may also become an input to sandboxing decisions down the line.)
  3. [Search Pipelines] Add a processor that provides fine-grained control over what queries are allowed #10938 -- Some OpenSearch administrators have asked for a feature that would let them restrict what kind of queries specific users/groups are allowed to send. In order to that, we first need to identify the users/groups.
  4. [RFC] User Behavior Insights #12084 -- User behavior logging is a little different, because we expect it to deal with a large number of users who probably aren't hitting the cluster directly. This is aimed users of a search application, where the search application hits the cluster.
  5. Slow logs -- While we don't have an open issue for it (I think), it would be worthwhile to include information about the source of a query in slow logs.

Across these different areas, we've proposed various slightly different ways of identifying the source of traffic to feed the proposed features.

I would like to propose that we first solve the problem of associating traffic with a specific user/workload (collectively "source").

Describe the solution you'd like

We've discussed various approaches to labeling the source of search traffic.

Let the client do it

In user behavior logging, the traffic is coming from a search application, which can presumably identify a user that has logged in to the search application. The application can provide identifying information in the body of a search request. This would also work for any other workload where the administrator has control over the clients that call OpenSearch.

Pros:

  • Very easy to implement in OpenSearch. We just need to add a new property to SearchRequest (or more likely SearchSourceBuilder, since it probably belongs in the body). In the simplest case, this property could just be a string. For more flexibility (e.g. to support the full suite of attributes in the UBI proposal), the property could be an object sent as JSON. Of course, as these labeling properties grow more complex, it also becomes harder for downstream consumers (like query insights) to know which object fields are relevant for categorization.
  • For application builders, it provides a lot of flexibility. Application builders know how they want to categorize workloads for later analysis.

Cons:

  • Doesn't work in an environment where administrators have granted direct cluster access to many users, since they can't assume that users will provide accurate identifying labels.

Rule-based labeling

This is the approach that @kaushalmahi12 proposed in his query sandboxing RFC. A component running on the cluster will inspect the incoming search request and assign a label. (Okay, in that proposal, it would assign a sandbox, but it's the same idea targeted to that specific feature.)

Pros:

  • Does not require changes to application code.
  • Does not rely on trusting clients of the cluster to do the right thing.
  • We could provide some sensible defaults. The coordinator node could tag the request with the source IP, for example. It would be great if we could take user identity information from the security plugin, but @peternied keeps telling me that's harder than it sounds (since the identity information might be a monstrous certificate).

Cons:

  • Developing a rule engine (however simple) is more complicated than just adding a property to a search request.
  • Need to worry about rule precedence.

Custom endpoints for different workloads

This is an evolution of what @peternied proposed in his RFC on views. Over on opensearch-project/security#4069, I linked a Google doc with my proposal for an entity that combines authorization (to the entity), access-control (to a set of indices or index patterns), document-level security (via a filter query), query restrictions, sandbox association, and more.

Pros:

  • Easy to administer. For a given tenant, everything is defined in one place.
  • No ambiguity. No conflicts. When you search via the given endpoint, the specified behavior is exactly what runs.
  • Reduces authorization load for the security plugin. No need to check every single index/alias when you're defining access control on the endpoint.

Cons:

  • Requires an endpoint per differentiated workload. Obviously won't work for something like UBI, where we're trying to learn from a larger user base. Gives a lot of power to administrators, but also a lot of responsibility. Assuming we store these endpoint definitions in cluster state (or any structure available to all coordinator nodes), you probably can't configure more than a few dozen or hundreds of them.
  • Administrators may need to worry about index patterns "accidentally" picking up access to new indices.

What do I recommend?

All of it! Or at least all of it in a phased approach, where we learn as we go. The above proposals are not mutually-exclusive and I can easily imagine scenarios where each is the best option. In particular, if we deliver the "Let the client do it" solution, we immediately unblock all the downstream projects, since all of the proposed options essentially boil down to reacting to labels attached to the SearchRequest (or more likely SearchSourceBuilder).

I think we should start with the first one (Let the client do it), since it's easy to implement. The rule-based approach can coexist, since it runs server-side and can override any client-provided information (or fail the request if the client is trying to be sneaky). I would recommend that as a fast-follow.

The last option is (IMO) nice to have, but limited to a somewhat niche set of installations. It's probably overkill for a small cluster with a few different sources of traffic, but it would be helpful for enterprise use-cases, where it's important to know exactly how a given tenant workload will behave.

Related component

Search

Describe alternatives you've considered

The above discussion covers three alternatives and suggests doing all three. If anyone else has suggestions for other alternatives, please comment!

What about indexing?

I only covered searches above, but there may be some value in applying the same logic to indexing, to identify workloads that are putting undue load on the cluster by sending too many and/or excessively large documents. My preferred approach to avoiding load from indexing is flipping the model from push-based to pull-based (so indexers manage their own load), but that's probably not going to happen any time soon. Also, a pull-based approach means that excessive traffic leads to indexing delays instead of indexers collapsing under load -- you still want to find out who is causing the delays.

@Bukhtawar, you're our resident indexing expert. Do you think we might be able to apply any of the approaches above to indexing? Ideally whatever we define for search would have a clear mirror implementation on the indexing side to provide a consistent user experience.

@msfroh msfroh added enhancement Enhancement or improvement to existing feature or request untriaged RFC Issues requesting major changes labels Apr 23, 2024
@github-actions github-actions bot added the Search Search query, autocomplete ...etc label Apr 23, 2024
@ansjcy
Copy link
Member

ansjcy commented Apr 23, 2024

Great proposal! From the search visibility point of view, currently the closest thing we can get about the "tenancy" of a search request relies on the thread context injected by the security plugin. The challenge here (also potentially the challenge to implement "default rules" mentioned in the RFC) is, the user has to use the Security plugin (Or we need to implement the methods to get "user" info for all identity plugins) and additionally, in VPC domains, the client IP obtained from the thread context is typically the ip of the Load Balancer (rather than the "real client ip"). We also briefly discussed some of those challenges in #12740 and #12529.

We really need a way to define multi-tenancy as the first class feature in OpenSearch, which allows us to define "tenancy" across various layers. We can start from multi-tenancy for search queries to solve issues mentioned in query sandboxing and query insights. And we should design the solution flexible enough so that it can be reused in indexing and plugins as well - The labeling mechanism outlined in the RFC can be a great first step for achieving full multi-tenancy in OpenSearch :)

Edit: Adding a very simple workflow diagram for client-side and rule-based labeling solutions, based on my understanding of this RFC:
image

Also, after chatting with .Jon-AtAWS about this topic, I wanted to add several things to keep in mind for tenant labeling:

  • We should limit the number of labels to avoid excessive propagation throughout the search phase. Setting an upper limit for labels may be necessary.
  • Ensure optimal performance for users opting not to use labels ("zero label"). Performance should not be compromised for users who choose not to leverage this feature.

@ansjcy
Copy link
Member

ansjcy commented Apr 23, 2024

Also @msfroh, how is

Custom endpoints for different workloads

really related to defining the tenancy info (or, associating traffic with a specific user/workload )? Please correct me if I'm wrong - I think it's more of a new way for managing a group of configuration to achieve "authorization, access-control, document-level security, query restrictions and more" for multi-tenant use cases?

@msfroh
Copy link
Collaborator Author

msfroh commented Apr 23, 2024

Also @msfroh, how is

Custom endpoints for different workloads

really related to defining the tenancy info (or, associating traffic with a specific user/workload )?

It's a way of explicitly defining the tenancy via the endpoint. If every tenant gets their own endpoint, it's server-side (so we don't need to trust the clients), but the resolution is easy/well-defined (versus applying rules).

We could just use the endpoints to apply tenant labels, but having the "all-in-one" configuration is a bonus, IMO.

@dblock
Copy link
Member

dblock commented Apr 23, 2024

Is there a variation of option 1 where the client is involved with establishing a session like typical web servers (requests without a session get a cookie from the server which is then passed back/around)?

@Bukhtawar
Copy link
Collaborator

Thanks @msfroh I like the proposal where it says the tags/identity etc aren't mutually exclusive. We can define ways to define an identity or a user-label or for that matter any other attribute associated with a request while keeping some of it auto-created like identity if the user is using a security plugin or a tab if the client is passing certain attributes.
For features like query sandboxing, we could define a sandbox rule and associate it either with a tag or identity to provide the right search experience to end customers while restricting/throttling limited set of bad identities

From an indexing perspective I would try to see if we could use something like #12683 or likes of data stream/index template to define the tenancy at an index level.
Apart from just indexing and search, we need to think about few isolation boundaries. Here I feel we also need to consider access permissions like separate index level encryption keys then decisions like pods/cell within a cluster as how data(indices for a given tenant) can be placed across these pods for logical isolation.

@msfroh
Copy link
Collaborator Author

msfroh commented Apr 23, 2024

Is there a variation of option 1 where the client is involved with establishing a session like typical web servers (requests without a session get a cookie from the server which is then passed back/around)?

That would be doable. I'm not immediately seeing how it helps for the multi-tenancy case -- as far as I can tell, it would help track "client that sent request A also sent requests B, C, D, and E". I'm not sure how the downstream components (query insights, slow logs, etc.) would use that information.

@peternied
Copy link
Member

Custom endpoints for different workloads

@msfroh Thanks for the great write up. Couple of comments in third form of this construct:

  • Scalability: This is an interesting consideration, but it seems inline with maximum limits on the number of index in a cluster. For a true enterprise 'multi-tenant' solution I think there will need to be considerable effort to decouple and rescale OpenSearch. I think this would be good to area to defer :D
  • "Gives a lot of power to administrators, but also a lot of responsibility": food for thought: admins largely have this power today, but it isn't easy to comprehend. Being more transparent about what dials and knobs are available for the service is a big win IMO with little downside.
  • Viability: Views was built as a direct mapping from the view name on to a fixed set of targets for a search request. It would not take much efforts to rename and modify the views systems to allow for a named set of targets to reshape the existing feature into _tentant/{my-tenant}/_target/{what-previously-was-the-view-name}/_search. I suspect that we'd want to do some additional research into what customers think about the trade-offs for additional endpoints vs configurability.

@kaushalmahi12
Copy link
Contributor

kaushalmahi12 commented Apr 23, 2024

Thanks for taking the time to write this up @msfroh!

Let the client do it

I am guessing the new object in the SearchRequest will still be a closed schema object (I mean there wouldn't be random fields coming in. e,g; it shouldn't be a Map<String, String>). Since this could potentially increase the memory footprint in the cluster for search heavy workloads.

Rule-based labeling

Regarding this I don't think this has to have affinity towards the sandboxing feature. Since rules will be an entity and can govern other actions as well like deciding whether to assign a label for features such as Sandboxing, QueryInsights or slow logs.

@peternied
Copy link
Member

[Triage - attendees 1 2 3 4 5 6 7]
@msfroh Thanks for creating this RFC, looking forward to see where this goes.

@sohami
Copy link
Collaborator

sohami commented Apr 24, 2024

@msfroh Thanks for the discussion and proposal. As you called out both the approaches 1 & 2 are mutually exclusive and I think we will need both. I think option 1, can work as a override for certain attribute (or tags) type and we should not allow to update all the attributes types with this option specially the attributes around user/roles or other sensitive ones. For example: If an application developer is accessing the cluster as userA, they should not be able to provide a user tag with value userB. The user related attributed should probably be set on server side only using the rule based mechanism (ignoring the complexity which folks have called out in obtaining it for now). Based on this example, it seems to me there are at the minm. 2 categories of the attributes 1) which can be a random key=value pair, 2) derived from the request context and not allowed to be overridden. I think client side mechanism can be useful for category 1. Also probably we will need to limit the attributes which clients can set via some cluster defined settings to avoid explosion of different tags (may not be needed right away but removes user application from sending irrelevant tags and play nice).

Does not require changes to application code.
Does not rely on trusting clients of the cluster to do the right thing.

As you have called out in pros of option 2, I think this will provide more control to the cluster administrator to begin with and enforce certain tag which can be later used in meaningful way (either for insights or logging or sandboxing). With option 1, administrator will need to rely on application developer honoring the tagging mechanism. So with this I was more inclined towards Option 2 which could give more adoption vs Option 1.

@ansjcy
Copy link
Member

ansjcy commented Apr 25, 2024

Here's a simple POC code for the customized and rule-based labeling we have been discussing :)
#13374
I added the ability to attach customized labels in a search query, and the bare-minimum rule-based labeling service to attach default user related information from security plugin. Also validated the use case to use top n queries service in query insights plugin to read those tags.

In the POC the labels are stored as a map, but as .kaushalmahi12 mentioned, we should consider using a closed schema object instead (possibly a json object as mentioned in the rfc). Also we should limit the number of labels we can have. Also, this POC only implements the workflow in search, ideally it should be generic enough to extend to other workflows.

res

@jainankitk
Copy link
Collaborator

@msfroh @ansjcy - I have similar concerns around security/validation as @sohami regarding Approach 1. Do we need to augment our existing security model to prevent identity spoofing?

@msfroh
Copy link
Collaborator Author

msfroh commented Jun 3, 2024

@msfroh @ansjcy - I have similar concerns around security/validation as @sohami regarding Approach 1. Do we need to augment our existing security model to prevent identity spoofing?

My thinking is that as soon as you apply approach 2 or 3, it should override any identity passed in the the request.

If we're accepting identities passed in the search request, we obviously can't trust them. That's explicitly called out in the "cons" for the approach.

@getsaurabh02 getsaurabh02 added the v2.15.0 Issues and PRs related to version 2.15.0 label Jun 3, 2024
@deshsidd
Copy link
Contributor

deshsidd commented Jun 3, 2024

My thinking is that as soon as you apply approach 2 or 3, it should override any identity passed in the the request.

Agreed with this approach.

Approach 1 will still be required for the use-cases where the traffic is coming from a search application and the application can provide the identifying information as mentioned in the RFC above.

@ansjcy
Copy link
Member

ansjcy commented Jun 3, 2024

In an ideal world, tenancy labeling should be calculated and provided by the centralized identity system (regardless which auth method your are using), but unfortunately we don't have one yet. For users who just want to know "who send what requests", the first approach should be good enough. As long as we don't use the labels for authentication/authorization, the security impact in this approach should be minimal. We can also have rules overriding all the customized labels or have a setting to disable the customized labeling (or limit the usage of certain important labels).

Ideally, after the related work in security side to provide an authoritative way to infer users/tenancy information for any type of authentication systems, we can then add "the rule" to overide / attach the labels based on that.

@kaushalmahi12
Copy link
Contributor

Although these labels may be harmless for some of the use cases upto some extent such as query insights (as long as the cardinality of the labels is low or QI has safeguards to prevent memory hogs). But if these labels are doing something intrusive like may be deciding the access to the resource distribution (example QueryGroup based Resource Allocation) then it becomes indispensable to have the safeguard mechanism to avoid these scenarios.

But then if we think of it in the long term do we really want to have the authN/authZ for these labels in all the consuming features of such labels ? Probably No!, since all of these features will be providing access to the users based on authN/authZ credentials or lets say rule based technique.

What I think that these labels should be used purely for routing purposes in the consuming features.

@hdhalter
Copy link

hdhalter commented Jun 6, 2024

Does this require documentation for 2.15? If so, please raise a doc PR by 6/10/24. Thanks.

@getsaurabh02
Copy link
Member

@ansjcy Can we create a meta and list all the related enhancements together as milestones for this?

@ansjcy
Copy link
Member

ansjcy commented Jun 25, 2024

Hi @getsaurabh02! We have this meta issue to track the multi-tenancy effort: #13516

Let me include all the ongoing work in this meta issue.

@ansjcy
Copy link
Member

ansjcy commented Jun 25, 2024

We had some interesting discussions in this PR opensearch-project/security#4403 with the security plugin folks. @DarshitChanpura @cwperks maybe we can continue the discussions in this thread for better visibility :)

  • How will the rule-based recommendation work with the current authz/authn workflow
  • Should we use ActionPlugin or have our own plugin interface (this PR explores this approach) instead of iterating through the components that were created to look for an instance of a Rule.

@getsaurabh02 getsaurabh02 moved this from 🆕 New to Later (6 months plus) in Search Project Board Aug 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request RFC Issues requesting major changes Roadmap:Stability/Availability/Resiliency Project-wide roadmap label Search Search query, autocomplete ...etc v2.15.0 Issues and PRs related to version 2.15.0
Projects
Status: 3.0.0 (TBD)
Status: New
Status: In-Review
Status: Later (6 months plus)
Development

No branches or pull requests