Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve efficiency of capabilities.resolveCapabilities #146881

Closed
dgieselaar opened this issue Dec 2, 2022 · 37 comments · Fixed by #170454
Closed

Improve efficiency of capabilities.resolveCapabilities #146881

dgieselaar opened this issue Dec 2, 2022 · 37 comments · Fixed by #170454
Assignees
Labels
performance Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc Team:Security Team focused on: Auth, Users, Roles, Spaces, Audit Logging, and more!

Comments

@dgieselaar
Copy link
Member

Currently, capabilities.resolveCapabilities executes all capabilities switchers sequentially. In some cases this is wasteful as e.g. the ML plugin needs to wait until switchers from Enterprise Search and the file upload plugin have modified the capabilities, even though they don't touch ML capabilities. This can add up, especially on clusters where requests from Kibana to ES take up more than 10ms. E.g. these are three parallel calls to resolveCapabilities:

image

We should consider improving this, e.g. by requiring capabilities switchers to register their dependencies, and asking consumers of resolveCapabilities to tell the framework what capabilities they're interested in, so switchers are executed only when necessary.

@dgieselaar dgieselaar added the Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc label Dec 2, 2022
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-core (Team:Core)

@pgayvallet
Copy link
Contributor

FWIW our current capabilities switcher:

  • enterprise_search:

/**
* Register user access to the Enterprise Search plugins
*/
capabilities.registerSwitcher(async (request: KibanaRequest) => {
const [, { spaces }] = await getStartServices();

  • fileUpload:

core.capabilities.registerSwitcher(async (request, capabilities, useDefaultCapabilities) => {
if (useDefaultCapabilities) {
return capabilities;
}

  • ml:

coreSetup.capabilities.registerSwitcher(getSwitcher(license$, logger));

  • spaces:

core.capabilities.registerSwitcher(setupCapabilitiesSwitcher(core, getSpacesService, logger));

  • security:

capabilities.registerSwitcher(
async (request: KibanaRequest, uiCapabilities: UICapabilities) => {

@rudolf
Copy link
Contributor

rudolf commented Dec 2, 2022

As a first step to understanding the impact we should consider adding an APM span for "resolving capabilities" so it's clear from the APM UI why these calls are being made.

@shahzad31
Copy link
Contributor

we are facing the same issue in synthetics plugin as well. Bulk operations becomes so huge with these calls. i think there must be a way to optimise or cache the same payload calls to resolve capabilities , example being

image

@pgayvallet
Copy link
Contributor

Bulk operations becomes so huge with these calls.

Sorry what kind of bulk operations you're referring to that would be calling capabilities resolving per object in the bulk?

@shahzad31
Copy link
Contributor

Bulk operations becomes so huge with these calls.

Sorry what kind of bulk operations you're referring to that would be calling capabilities resolving per object in the bulk?

it's just normal es search calls, let's say if you have to do many in sequence or parellel. It ends up calling has_privilage before every call.

@pgayvallet
Copy link
Contributor

it's just normal es search calls, let's say if you have to do many in sequence or parellel. It ends up calling has_privilage before every call

Hum, I'm probably missing something obvious here, but we're not implicitly resolving capabilities when using the ES client APIs directly. Are you referring to 'bulk' operations using the SO client maybe?

pgayvallet added a commit that referenced this issue Dec 12, 2022
## Summary

Part of #146881

Co-authored-by: Kibana Machine <[email protected]>
@dgieselaar
Copy link
Member Author

@pgayvallet @rudolf is there a timeline yet? Looking at APM data from the overview cluster the p95 for _has_privileges is 80ms and the p50 is 10ms. For simplicity's sake I assume that calls to get a single doc by id have similar performance characteristics. IMHO this could be a big win in improving how snappy Kibana feels.

@pgayvallet
Copy link
Contributor

is there a timeline yet?

There isn't atm.

@rudolf
Copy link
Contributor

rudolf commented Feb 16, 2023

We should consider improving this, e.g. by requiring capabilities switchers to register their dependencies, and asking consumers of resolveCapabilities to tell the framework what capabilities they're interested in, so switchers are executed only when necessary.

From what I can tell switchers are independant but it's hard to be sure. Having a dependency tree feels like much more complexity than if switchers were independant.

Would it be sufficient if we extended CapabilitiesSetup.registerSwitcher to include a switcher name like:

registerSwitcher(switcherName: string, switcher: CapabilitiesSwitcher): void;

Then CapabilitiesStart.resolveCapabilities could filter the switchers to use:

resolveCapabilities(request: KibanaRequest, {switchersToUse?: string[]}): Promise<Capabilities>;

Could be more powerful if consumers could supply a filter function instead of a static list, but that's maybe over-engineering this.

@rudolf
Copy link
Contributor

rudolf commented Feb 16, 2023

An alternative might be to use something like our request context to lazily load switchers only when accessed (see #129896) Perhaps this would be the best architecture here, but feels like a larger refactor.

@azasypkin
Copy link
Member

resolveCapabilities(request: KibanaRequest, {switchersToUse?: string[]}): Promise;

I'm slightly concerned that with such approach we might end up with insecure code if security switchers are omitted for whatever reason (unless we have non-skippable "kernel" and skippable "user-land" switchers or something along these lines).

I'm wondering if it's possible to design an API in such a way that every registerSwitcher could define which exact capabilities it can influence and resolveCapabilities could ask for specific caps too. Then we could "automatically" skip switchers that don't influence requested capabilities (would be caps: * for security switchers though).

@rudolf
Copy link
Contributor

rudolf commented Feb 16, 2023

I'm wondering if it's possible to design an API in such a way that every registerSwitcher could define which exact capabilities it can influence and resolveCapabilities could ask for specific caps too. Then we could "automatically" skip switchers that don't influence requested capabilities (would be caps: * for security switchers though).

Yeah this would be similar to how the request context works. It would require that capabilities be in a namespace like ml.* and that it's an async operation to retrieve them.

The other problem request context already solves IIRC is that it caches resolved values. So if there's several resolevCapabilities calls it would only resolve the spaces capabilities once.

@pgayvallet
Copy link
Contributor

The other problem request context already solves IIRC is that it caches resolved values. So if there's several resolevCapabilities calls it would only resolve the spaces capabilities once.

We could eventually think of keeping a cache of the resolved capabilities internally within our service. Note that I don't think it would be the proper way to address the problem, but it may be a pragmatic workaround (way less impactful in term of changes compares to the other alternatives) if most of the perf cost is caused by the fact that we're calling resolveCapabilities multiple time for some of our handlers. It would require to do the WeakMap trick we're already doing for other request->stuff internal caching, but I don't think it would technically be problematic?

We could potentially even go further in term of caching if we were able to properly extract (and use) the user's principal instead of the request as the cache's keys. That way, we could re-use capabilities between requests of a given user (and then probably use time-based - or LRU? - eviction cache). But we lack the security<->core interactions to do this right now, probably.

@azasypkin @rudolf Do you think that internal cache thing could make sense, though? Like, could it be good enough? Or do we want to more focus on the proper approach around the resolveCapabilities(request, { pathsToResolve? }) API change?

@azasypkin
Copy link
Member

if most of the perf cost is caused by the fact that we're calling resolveCapabilities multiple time for some of our handlers

That's my understanding, this and the fact that we invoke all resolvers, even those that might not be relevant for the specific context (but that likely would be a smaller concern if we have a cache of some sort).

But we lack the security<->core interactions to do this right now, probably.

Yeah, we've been talking about introducing a notion of user or principal in the core that security can populate for a quite some time already (instead of opaque authState), we might just need to decide and do it at some point :)

@azasypkin @rudolf Do you think that internal cache thing could make sense, though? Like, could it be good enough?

In theory, I think it could work assuming the cache is very short-lived (user privileges can change at any time, but if cache TTL is just a few seconds it shouldn't be a problem). If it's feasible to have a quick PoC to measure the real impact, then it'd be much easier to decide how reasonable this trade-off is though.

@rudolf
Copy link
Contributor

rudolf commented Mar 1, 2023

At least part of the problem here is that ML calls resolveCapabilities three times. Seems like once for each of the calls in:

https://github.com/elastic/kibana/blob/main/x-pack/plugins/apm/server/lib/anomaly_detection/get_ml_jobs_with_apm_group.ts#L32-L42

I don't really understand how this code works...
https://github.com/elastic/kibana/blob/main/x-pack/plugins/ml/server/shared_services/providers/anomaly_detectors.ts#L37-L82

@jgowdyelastic
Copy link
Member

jgowdyelastic commented Mar 1, 2023

At least part of the problem here is that ML calls resolveCapabilities three times. Seems like once for each of the calls in:

https://github.com/elastic/kibana/blob/main/x-pack/plugins/apm/server/lib/anomaly_detection/get_ml_jobs_with_apm_group.ts#L32-L42

For this specific example we (ML) are only calling resolveCapabilities once now. We cache the response of resolveCapabilities for the length of the request and these three calls in APM all use the same request object.

We reduced our calls to resolveCapabilities in this PR

@afharo
Copy link
Member

afharo commented Mar 2, 2023

@azasypkin @rudolf Do you think that internal cache thing could make sense, though? Like, could it be good enough? Or do we want to more focus on the proper approach around the resolveCapabilities(request, { pathsToResolve? }) API change?

I think both solutions should coexist:

  • Paths to resolve would avoid making useless calls to resolve capabilities that are not required in that context.
  • However, as mentioned earlier, security applies to all cap resolutions, so caching the resolutions for the duration of the request might be beneficial.

Re caching per request vs. per user... I'd vote per request because users' permissions might change and we'd need to react to those changes to evict the cache. We don't have hooks to do that (especially in large deployments with N instances of Kibana).

@rudolf
Copy link
Contributor

rudolf commented Mar 2, 2023

The above was when testing on a 8.6.2 distribution build. But when I looked at main it seems like it's better and there's just one resolveCapabilities call. (UPDATE didn't see @jgowdyelastic 's comment, glad to see it was solve in #146155 ❤️ )

I hacked in spans for each resolver which then looks like this:
Screenshot 2023-03-02 at 12 58 57

So unless I need to add some data to cause the three resolveCapabilities calls it seems like caching won't help for this particular API. One quick win is to resolve in parallel instead of in series (assuming they are completely independant). This would still cause unecessary load, but at least it won't impact the response time of the route when the server isn't under stress.

@rudolf
Copy link
Contributor

rudolf commented Mar 2, 2023

I don't think it would make a huge difference here, but getting the current space is a fairly frequent operation. So having that cached per request feels like it could be a worthwhile overal performance improvement. In this case it would save one getCurrentSpace roundtrip.

@pgayvallet
Copy link
Contributor

pgayvallet commented Mar 7, 2023

One quick win is to resolve in parallel instead of in series (assuming they are completely independant)

It may be a bit tricky because the global security switcher can alter any entry in the capability map. But we may be able to work something around by defining a priority (disabling a capability prevails on enabling it, so when we merge the return of all switchers, false values always override true ones).

So, yeah, it could be a quick win indeed.

EDIT: looking at the code, I'm not even sure we need that kind of priority logic
EDIT2: actually we do, because some switcher (security and spaces) return the whole capabilities objects, so true values can override the ones set to false by another switcher)

@pgayvallet
Copy link
Contributor

I opened #152784 that implements the said optimization

@pgayvallet
Copy link
Contributor

pgayvallet commented Mar 7, 2023

So, yeah... it's not that easy.

Restricting providers to be only allowed to toggle features from on to off doesn't work because some of then do it the other way around apparently, and allowing toggling in both direction doesn't work if switchers are run in parallel, as the merged of the output will not be able to know which in which order to merge them (e.g switcherA disable a feature, then security returns the whole map with the feature enabled, so the disabling is ignored at the end of the merge...)

@rudolf
Copy link
Contributor

rudolf commented Mar 8, 2023

@azasypkin Would it be possible to refactor the security switchers to ensure they can be run in parallel and safely merged?

pgayvallet added a commit that referenced this issue Mar 9, 2023
## Summary

Related to #146881

I tried to fix the issue, but couldn't so I kept the tiny optimizations
I made on the way:
- optimize reduce blocks to re-use the memo instead of spreading into a
new object
- update most of the existing resolver to only return the list of
changes rather than the whole capability objects (which was how it was
supposed to work) - minor perf improvement when merging the result of
the resolvers
@pgayvallet
Copy link
Contributor

pgayvallet commented Mar 9, 2023

@elastic/kibana-security @elastic/enterprise-search-frontend I think I gonna need your help on this one, to try to understand how the enterprise_search capabilities switcher works.

In #152982, I try to optimize the way the switchers are run/applied by calculating the delta of the returned capabilities, and applying it to the capabilities after. It allows to run the switchers in parallel.

It does though have a prerequisite that I that we had: that switcher are allowing to toggle features, but not to force features to remain the same Basically everything works fine unless two switchers are returning conflicting consecutive switches (e.g switcher A ask the feature to be enabled and switcher B ask the feature to be disabled)

Or, to schematize: this scenario shouldn't occur:

Screenshot 2023-03-09 at 11 30 02

And, I naively though the assumption was reasonable.

However, looking at the failure of #152982 (like this one), it seems my assumption was false, and that this scenario occurs between the security switcher and the enterpriseSearch one

So would you mind helping me to understand:

  1. Why does enterprise search need some specific logic to enable/disable the navLinks and catalog entries, when most (all for navlinks?) other applications are just delegating to the security and spaces global switchers?

  2. Why is there a conflict on the return from the enterprise search and security switcher? From my understanding, the security plugin is disabling these navlinks, but then the enterprise search switcher is re-enabling them. If the toggling of these navLinks should only be handled by the enterpriseSearch switcher, then why is the security one toggling them off in the first place?

The switcher we're talking about:

capabilities.registerSwitcher(async (request: KibanaRequest) => {
const [, { spaces }] = await getStartServices();
const dependencies = { config, security, spaces, request, log, ml };
const { hasAppSearchAccess, hasWorkplaceSearchAccess } = await checkAccess(dependencies);
const showEnterpriseSearch = hasAppSearchAccess || hasWorkplaceSearchAccess;
return {
navLinks: {
enterpriseSearch: showEnterpriseSearch,
enterpriseSearchContent: showEnterpriseSearch,
enterpriseSearchAnalytics: showEnterpriseSearch,
elasticsearch: showEnterpriseSearch,
appSearch: hasAppSearchAccess,
workplaceSearch: hasWorkplaceSearchAccess,
searchExperiences: showEnterpriseSearch,
},
catalogue: {
enterpriseSearch: showEnterpriseSearch,
enterpriseSearchContent: showEnterpriseSearch,
enterpriseSearchAnalytics: showEnterpriseSearch,
elasticsearch: showEnterpriseSearch,
appSearch: hasAppSearchAccess,
workplaceSearch: hasWorkplaceSearchAccess,
searchExperiences: showEnterpriseSearch,
},
};
});

@sphilipse
Copy link
Member

Hey @pgayvallet, thanks for tagging @elastic/enterprise-search-frontend. I'll do my best to answer your questions.

  1. Has a simple answer: we need to call the Enterprise Search application that runs separately from Kibana and Elasticsearch to determine whether we have access to its specific capabilities (and which capabilities). We can't delegate that to the security plugin, because the security plugin doesn't (and probably shouldn't) have access to that external application.
  2. I don't have a good answer here! I imagine that setting all of these to false was a good default, in case the Enterprise Search plugin failed for some reason. But the security plugin probably shouldn't be touching these navLinks in the first place. Is only setting those navLinks in the Enterprise Search switcher and not in the security switcher a viable path for you? I would want to do a little bit of testing, but I don't expect that to break anything for Enterprise Search.

I'm going to cc @cee-chen here, because she worked on the initial implementation and might have some insight into whether there's a specific reason why the security plugin is setting things to false that we then toggle back on.

@dgieselaar
Copy link
Member Author

@sphilipse @cee-chen FWICT, at least the three calls (callEnterpriseSearchConfigAPI, isSuperUser and getActiveSpace ) in the Enterprise Search switcher can be parallelised.

@sphilipse
Copy link
Member

Yeah, we can definitely work on making the checks more efficient if that's helpful.

@cee-chen
Copy link
Contributor

cee-chen commented Mar 9, 2023

I'm going to cc @cee-chen here, because she worked on the initial implementation and might have some insight into whether there's a specific reason why the security plugin is setting things to false that we then toggle back on.

It's been a hot second but from my vague recollection, adding those capability checks were a requirement from the security team - I think I worked with @legrego pretty closely on that three or so years ago. I'm honestly not a huge fan of the code or how it works or anything - if the requirements have changed, I definitely think the code and its behavior/performance should change as well.

FWIW all of Sander's answers were spot on. For question 2, I agree that the security plugin/switcher probably should not be involved in the first place(?) - except if EntSearch is disabled entirely/explicitly via spaces or admin-level configuration. Once it's enabled however, then EntSearch has its own separate set of capabilities outside Kibana that needs to take precedence.

@jeramysoucy
Copy link
Contributor

We (@elastic/kibana-security ) are looking into the details and history here. Fortunately, I will see @legrego next week and hopefully he can shed some light on this. Before we do anything to change the behavior of the security code, I'd like to get buy-in from the team.

bmorelli25 pushed a commit to bmorelli25/kibana that referenced this issue Mar 10, 2023
## Summary

Related to elastic#146881

I tried to fix the issue, but couldn't so I kept the tiny optimizations
I made on the way:
- optimize reduce blocks to re-use the memo instead of spreading into a
new object
- update most of the existing resolver to only return the list of
changes rather than the whole capability objects (which was how it was
supposed to work) - minor perf improvement when merging the result of
the resolvers
@legrego
Copy link
Member

legrego commented Mar 17, 2023

Why is there a conflict on the return from the enterprise search and security switcher? From my understanding, the security plugin is disabling these navlinks, but then the enterprise search switcher is re-enabling them. If the toggling of these navLinks should only be handled by the enterpriseSearch switcher, then why is the security one toggling them off in the first place?

I suspect there is a bug in the security plugin's capability switcher. We should be ignoring features which have opted out of our security controls, but we don't appear to be doing that. I can work with someone on the team next week to take a closer look.

@rudolf rudolf added the Team:Security Team focused on: Auth, Users, Roles, Spaces, Audit Logging, and more! label Mar 21, 2023
@jeramysoucy
Copy link
Contributor

As Larry suspected, the security plugin's capabilities switcher is not making considerations for features that are not governed by our security model. We reviewed how we're handling disabling features, and we should be able to address the issue and resolve the conflict with enterprise search. 👍

@jeramysoucy
Copy link
Contributor

PR in progress to unblock #154098

@pgayvallet
Copy link
Contributor

I rebased #152982 now that #154098 has been merged, and we're still getting failures on ML FTR capabilities tests. We'll need to investigate more to go further.

I'll unassign myself from the issue given we're not planning on working on it right now.

@pgayvallet pgayvallet removed their assignment Sep 19, 2023
@jgowdyelastic
Copy link
Member

I don't know if it is related to this, but I notice that the capabilities switcher is not being called when kibana endpoints are called.
ML's routes all have access tags on them which should be performing capabilities checks.
If I call an endpoint from an external REST client, which should be blocked because of logic in the switcher, because the switcher isn't being used, I'm still able to access the route.

@pgayvallet
Copy link
Contributor

ML's routes all have access tags on them which should be performing capabilities checks.

Is that really how it's supposed to work @elastic/kibana-security ?

@legrego
Copy link
Member

legrego commented Sep 19, 2023

ML's routes all have access tags on them which should be performing capabilities checks.

Is that really how it's supposed to work @elastic/kibana-security ?

No. Generally, the access tags will automatically perform authorization checks, but that does not require us to resolve capabilities.

@pgayvallet pgayvallet self-assigned this Nov 2, 2023
pgayvallet added a commit that referenced this issue Nov 9, 2023
…170454)

## Summary

Fix #146881

Introduce the concept of "capability path" to Core's capabilities API,
and rely on it to perform various performance optimization during
capabilities resolving.

### API Changes

#### CapabilitiesSetup.registerSwitcher

A new mandatory `capabilityPath` option was added to the API signature. 

Plugins registering capability switchers must now define the path(s) of
capabilities the switcher will impact.

E.g a live example with the `ml` capabilities switcher that was only
mutating `ml.{something}` capabilities:

*Before:*

```ts
coreSetup.capabilities.registerSwitcher(getSwitcher(license$, logger, enabledFeatures));
```

*After:*

```ts
  coreSetup.capabilities.registerSwitcher(getSwitcher(license$, logger, enabledFeatures), {
    capabilityPath: 'ml.*',
  });
```

#### CapabilitiesStart.resolveCapabilities

The `resolveCapabilities` was also changed accordingly, forcing API
consumers to specify the path(s) of capabilities they're planning to
access.

E.g for the `ml` plugin's capabilities resolving

*Before:*

```ts
const capabilities = await this.capabilities.resolveCapabilities(request);
return capabilities.ml as MlCapabilities;
```

*After:*

```ts
const capabilities = await this.capabilities.resolveCapabilities(request, {
   capabilityPath: 'ml.*',
});
return capabilities.ml as MlCapabilities;
```

### Performance optimizations

Knowing which capability path(s) the switchers are impacting and which
capability path(s) the resolver wants to use allow us to optimize the
way we're chaining the resolvers during the `resolveCapabilities`
internal implementation:

#### 1. We only apply switchers that may impact the paths the resolver
requested

E.g when requesting the ml capabilities, we now only apply the `ml`,
`security` and `spaces` switchers.

#### 2. We can call non-intersecting switchers in parallel 

Before, all switchers were executed in sequence. With these changes, we
now group non-intersecting switchers to resolve them in parallel.

E.g the `ml` (`ml.*`) and `fileUpload` (`fileUpload.*`) can be executed
and applied in parallel because they're not touching the same set of
capabilities.

---------

Co-authored-by: Kibana Machine <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc Team:Security Team focused on: Auth, Users, Roles, Spaces, Audit Logging, and more!
Projects
None yet
Development

Successfully merging a pull request may close this issue.