-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve efficiency of capabilities.resolveCapabilities
#146881
Comments
Pinging @elastic/kibana-core (Team:Core) |
FWIW our current capabilities switcher:
kibana/x-pack/plugins/enterprise_search/server/plugin.ts Lines 145 to 149 in 463007f
kibana/x-pack/plugins/file_upload/server/capabilities.ts Lines 23 to 26 in 3730dd0
kibana/x-pack/plugins/security/server/authorization/authorization_service.tsx Lines 151 to 152 in ac15ee4
|
As a first step to understanding the impact we should consider adding an APM span for "resolving capabilities" so it's clear from the APM UI why these calls are being made. |
Sorry what kind of bulk operations you're referring to that would be calling capabilities resolving per object in the bulk? |
it's just normal es search calls, let's say if you have to do many in sequence or parellel. It ends up calling |
Hum, I'm probably missing something obvious here, but we're not implicitly resolving capabilities when using the ES client APIs directly. Are you referring to 'bulk' operations using the SO client maybe? |
## Summary Part of #146881 Co-authored-by: Kibana Machine <[email protected]>
@pgayvallet @rudolf is there a timeline yet? Looking at APM data from the overview cluster the p95 for _has_privileges is 80ms and the p50 is 10ms. For simplicity's sake I assume that calls to get a single doc by id have similar performance characteristics. IMHO this could be a big win in improving how snappy Kibana feels. |
There isn't atm. |
From what I can tell switchers are independant but it's hard to be sure. Having a dependency tree feels like much more complexity than if switchers were independant. Would it be sufficient if we extended
Then
Could be more powerful if consumers could supply a filter function instead of a static list, but that's maybe over-engineering this. |
An alternative might be to use something like our request context to lazily load switchers only when accessed (see #129896) Perhaps this would be the best architecture here, but feels like a larger refactor. |
I'm slightly concerned that with such approach we might end up with insecure code if I'm wondering if it's possible to design an API in such a way that every |
Yeah this would be similar to how the request context works. It would require that capabilities be in a namespace like The other problem request context already solves IIRC is that it caches resolved values. So if there's several resolevCapabilities calls it would only resolve the spaces capabilities once. |
We could eventually think of keeping a cache of the resolved capabilities internally within our service. Note that I don't think it would be the proper way to address the problem, but it may be a pragmatic workaround (way less impactful in term of changes compares to the other alternatives) if most of the perf cost is caused by the fact that we're calling We could potentially even go further in term of caching if we were able to properly extract (and use) the user's principal instead of the request as the cache's keys. That way, we could re-use capabilities between requests of a given user (and then probably use time-based - or LRU? - eviction cache). But we lack the @azasypkin @rudolf Do you think that internal cache thing could make sense, though? Like, could it be good enough? Or do we want to more focus on the proper approach around the |
That's my understanding, this and the fact that we invoke all resolvers, even those that might not be relevant for the specific context (but that likely would be a smaller concern if we have a cache of some sort).
Yeah, we've been talking about introducing a notion of user or principal in the core that security can populate for a quite some time already (instead of opaque
In theory, I think it could work assuming the cache is very short-lived (user privileges can change at any time, but if cache TTL is just a few seconds it shouldn't be a problem). If it's feasible to have a quick PoC to measure the real impact, then it'd be much easier to decide how reasonable this trade-off is though. |
At least part of the problem here is that ML calls I don't really understand how this code works... |
For this specific example we (ML) are only calling We reduced our calls to |
I think both solutions should coexist:
Re caching per request vs. per user... I'd vote per request because users' permissions might change and we'd need to react to those changes to evict the cache. We don't have hooks to do that (especially in large deployments with N instances of Kibana). |
The above was when testing on a 8.6.2 distribution build. But when I looked at I hacked in spans for each resolver which then looks like this:
|
I don't think it would make a huge difference here, but getting the current space is a fairly frequent operation. So having that cached per request feels like it could be a worthwhile overal performance improvement. In this case it would save one getCurrentSpace roundtrip. |
It may be a bit tricky because the global security switcher can alter any entry in the capability map. But we may be able to work something around by defining a priority (disabling a capability prevails on enabling it, so when we merge the return of all switchers, So, yeah, it could be a quick win indeed. EDIT: looking at the code, I'm not even sure we need that kind of priority logic |
I opened #152784 that implements the said optimization |
So, yeah... it's not that easy. Restricting providers to be only allowed to toggle features from |
@azasypkin Would it be possible to refactor the security switchers to ensure they can be run in parallel and safely merged? |
## Summary Related to #146881 I tried to fix the issue, but couldn't so I kept the tiny optimizations I made on the way: - optimize reduce blocks to re-use the memo instead of spreading into a new object - update most of the existing resolver to only return the list of changes rather than the whole capability objects (which was how it was supposed to work) - minor perf improvement when merging the result of the resolvers
@elastic/kibana-security @elastic/enterprise-search-frontend I think I gonna need your help on this one, to try to understand how the In #152982, I try to optimize the way the switchers are run/applied by calculating the delta of the returned capabilities, and applying it to the capabilities after. It allows to run the switchers in parallel. It does though have a prerequisite that I that we had: that switcher are allowing to toggle features, but not to force features to remain the same Basically everything works fine unless two switchers are returning conflicting consecutive switches (e.g switcher A ask the feature to be enabled and switcher B ask the feature to be disabled) Or, to schematize: this scenario shouldn't occur: And, I naively though the assumption was reasonable. However, looking at the failure of #152982 (like this one), it seems my assumption was false, and that this scenario occurs between the So would you mind helping me to understand:
The switcher we're talking about: kibana/x-pack/plugins/enterprise_search/server/plugin.ts Lines 169 to 197 in 8705a6a
|
Hey @pgayvallet, thanks for tagging @elastic/enterprise-search-frontend. I'll do my best to answer your questions.
I'm going to cc @cee-chen here, because she worked on the initial implementation and might have some insight into whether there's a specific reason why the security plugin is setting things to false that we then toggle back on. |
@sphilipse @cee-chen FWICT, at least the three calls ( |
Yeah, we can definitely work on making the checks more efficient if that's helpful. |
It's been a hot second but from my vague recollection, adding those capability checks were a requirement from the security team - I think I worked with @legrego pretty closely on that three or so years ago. I'm honestly not a huge fan of the code or how it works or anything - if the requirements have changed, I definitely think the code and its behavior/performance should change as well. FWIW all of Sander's answers were spot on. For question 2, I agree that the security plugin/switcher probably should not be involved in the first place(?) - except if EntSearch is disabled entirely/explicitly via spaces or admin-level configuration. Once it's enabled however, then EntSearch has its own separate set of capabilities outside Kibana that needs to take precedence. |
We (@elastic/kibana-security ) are looking into the details and history here. Fortunately, I will see @legrego next week and hopefully he can shed some light on this. Before we do anything to change the behavior of the security code, I'd like to get buy-in from the team. |
## Summary Related to elastic#146881 I tried to fix the issue, but couldn't so I kept the tiny optimizations I made on the way: - optimize reduce blocks to re-use the memo instead of spreading into a new object - update most of the existing resolver to only return the list of changes rather than the whole capability objects (which was how it was supposed to work) - minor perf improvement when merging the result of the resolvers
I suspect there is a bug in the security plugin's capability switcher. We should be ignoring features which have opted out of our security controls, but we don't appear to be doing that. I can work with someone on the team next week to take a closer look. |
As Larry suspected, the security plugin's capabilities switcher is not making considerations for features that are not governed by our security model. We reviewed how we're handling disabling features, and we should be able to address the issue and resolve the conflict with enterprise search. 👍 |
PR in progress to unblock #154098 |
I don't know if it is related to this, but I notice that the capabilities switcher is not being called when kibana endpoints are called. |
Is that really how it's supposed to work @elastic/kibana-security ? |
No. Generally, the |
…170454) ## Summary Fix #146881 Introduce the concept of "capability path" to Core's capabilities API, and rely on it to perform various performance optimization during capabilities resolving. ### API Changes #### CapabilitiesSetup.registerSwitcher A new mandatory `capabilityPath` option was added to the API signature. Plugins registering capability switchers must now define the path(s) of capabilities the switcher will impact. E.g a live example with the `ml` capabilities switcher that was only mutating `ml.{something}` capabilities: *Before:* ```ts coreSetup.capabilities.registerSwitcher(getSwitcher(license$, logger, enabledFeatures)); ``` *After:* ```ts coreSetup.capabilities.registerSwitcher(getSwitcher(license$, logger, enabledFeatures), { capabilityPath: 'ml.*', }); ``` #### CapabilitiesStart.resolveCapabilities The `resolveCapabilities` was also changed accordingly, forcing API consumers to specify the path(s) of capabilities they're planning to access. E.g for the `ml` plugin's capabilities resolving *Before:* ```ts const capabilities = await this.capabilities.resolveCapabilities(request); return capabilities.ml as MlCapabilities; ``` *After:* ```ts const capabilities = await this.capabilities.resolveCapabilities(request, { capabilityPath: 'ml.*', }); return capabilities.ml as MlCapabilities; ``` ### Performance optimizations Knowing which capability path(s) the switchers are impacting and which capability path(s) the resolver wants to use allow us to optimize the way we're chaining the resolvers during the `resolveCapabilities` internal implementation: #### 1. We only apply switchers that may impact the paths the resolver requested E.g when requesting the ml capabilities, we now only apply the `ml`, `security` and `spaces` switchers. #### 2. We can call non-intersecting switchers in parallel Before, all switchers were executed in sequence. With these changes, we now group non-intersecting switchers to resolve them in parallel. E.g the `ml` (`ml.*`) and `fileUpload` (`fileUpload.*`) can be executed and applied in parallel because they're not touching the same set of capabilities. --------- Co-authored-by: Kibana Machine <[email protected]>
Currently,
capabilities.resolveCapabilities
executes all capabilities switchers sequentially. In some cases this is wasteful as e.g. the ML plugin needs to wait until switchers from Enterprise Search and the file upload plugin have modified the capabilities, even though they don't touch ML capabilities. This can add up, especially on clusters where requests from Kibana to ES take up more than 10ms. E.g. these are three parallel calls toresolveCapabilities
:We should consider improving this, e.g. by requiring capabilities switchers to register their dependencies, and asking consumers of
resolveCapabilities
to tell the framework what capabilities they're interested in, so switchers are executed only when necessary.The text was updated successfully, but these errors were encountered: