-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Query Sandboxing for search requests #11061
Comments
Thanks @kaushalmahi12 for proposing this. This is going to be super useful for maintaining the resiliency of clusters, especially the large ones with multiple tenants or users. I like the idea of It will then allow system/cluster admins to create/maintain multiple sandboxing configurations and map them to the group of internal users (or tenants). This can then further be integrated with the This will also provide common extension points for associating these user-account/tenant-id with Top-N queries which is focussed on the extended visibility of individual queries. While allowing admins to be able to map these rogue/slow queries back to users. |
I think this idea was discussed in #8879 |
@getsaurabh02 Thanks for your suggestions!
What I meant to convey is that these attributes should be available across all OpenSearch clusters since these attributes will be part of either authN/authZ or part of request attribute. Now within the cluster we can always define new entities and associate them to sandboxes. Does this make sense or I am missing something ?
I agree with it. Since the feature inherently limits the access to resources, |
@kaushalmahi12 Thanks for the proposal! As .getsaurabh02 mentioned, I think with the combination of Query Sanboxing and Top N Queries, OpenSearch admin users can potentially have both better control and better visibility into the rogue queries by tenents. Regarding getting the resource usage data from |
Author - Kaushal Kumar
Parent RFC - #8879
Introduction
In most of the information retrieval systems it is very common to see the performance impact on few tenants of the system caused by other tenants. As few tenants can take significant amount of system resources leaving the others deprived. Hence it becomes critically important for a IR system to minimize the performance impact for all tenants. What could be a better way than providing the user to define and configure the limits and priority for the tenants of the system.
(Node drops)/(Poor performance) due to bad behaving tenant queries are one of the common pain in the OpenSearch clusters. Creating the tenant based performance isolation in OpenSearch becomes critically important improve the resiliency and stability of the cluster with Cx value as bonus.
Tenant here I am assuming an User or Index but not limited to these
Problem
In the current OpenSearch there is no mechanism for tenant based performance isolation for search workloads. We want to enable the admin users of OpenSearch cluster to manage tenant based
Sandboxes
to enforce resource based limits on the tenant queries. Each Sandbox will have a priority which will determine the cancellation order in node duress situations.
Scope
Since we want to partition the resources amongst the tenants on a node, It makes more sense to confine this feature to node level so that node level resiliency is achieved.
Use cases
Proposal
We are proposing to introduce a reactive mechanism to actively track the resources for tenants and cancel in case of oversubscription of system resources for tenant/s. This will help us in identifying and cancelling the rogue queries reactively and help us maintain the node stability.
As part of this proposal we will introduce new software constructs called
Sandbox
which will be attribute based and admin users of OpenSearch cluster can manage (CRUD ability) at node level. . The attributes we are selecting will be generic across all users(Domains/cluster). For the System resources we will track thejvmAllocations
(due to jdk api limitation for thread level current jvm usage) andcpuUtilisation
.Other system resources like network IO, Disk IO we are not considering because of multiple reasons
/proc
but this data is loaded from kernel data structures which stores this info in binary. To access this information per thread stats for IO is 3 sys calls(open, read, close),A Sandbox will track the resources for all the requests associated with it and will try to enforce the resource usage limits per sandbox. We will cancel the queries from low priority sandboxes in node duress scenarios. Since tracking the resource usage could be an overhead for too many sandboxes in the system, we can limit this with cluster level setting to enforce node level count of sandboxes.
We are planning to start with reactive mechanism i,e; track and cancel in case of contention or threshold breaches. But going forward we want to build a robust search query cost estimation framework to cancel majority of the search queries upfront.
Future Improvements
The text was updated successfully, but these errors were encountered: