Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Authorization for Internal Requests like Stats or Rollover is Slow in Large Clusters #79632

Closed
Tracked by #77466
original-brownbear opened this issue Oct 21, 2021 · 13 comments
Labels
>bug :Security/Authorization Roles, Privileges, DLS/FLS, RBAC/ABAC Team:Security Meta label for security team

Comments

@original-brownbear
Copy link
Member

original-brownbear commented Oct 21, 2021

This is a bit of a sub-issue of #67987 but I wonder if we can take similar short cuts here to the ones we took for e.g. local field caps requests.

For stats we have a lot of slow logging like this:

[2021-10-21T13:47:35,324][INFO ][o.e.x.s.a.AuthorizationService] [elasticsearch-8] Took [51ms] to resolve [42493] indices for action [indices:monitor/stats] and user [esbench]
[2021-10-21T13:47:35,956][INFO ][o.e.x.s.a.AuthorizationService] [elasticsearch-8] Took [52ms] to resolve [42493] indices for action [indices:monitor/stats] and user [esbench]
[2021-10-21T13:47:36,737][INFO ][o.e.x.s.a.AuthorizationService] [elasticsearch-8] Took [72ms] to resolve [42494] indices for action [indices:monitor/stats] and user [esbench]
[2021-10-21T13:47:37,199][INFO ][o.e.x.s.a.AuthorizationService] [elasticsearch-8] Took [81ms] to resolve [42494] indices for action [indices:monitor/stats] and user [esbench]
[2021-10-21T13:47:38,208][INFO ][o.e.x.s.a.AuthorizationService] [elasticsearch-8] Took [56ms] to resolve [42494] indices for action [indices:monitor/stats] and user [esbench]

on the master node once monitoring asks for stats. We do not have similar logging for field caps requests, so something must be different here in how the fan-out works. Could we short-circuit it here as well for a quick-win and more stability on transport threads?

We also see the same thing for setting updates (these are node local requests from ILM):

[2021-10-21T10:45:23,537][INFO ][o.e.x.s.a.AuthorizationService] [elasticsearch-8] Took [96ms] to resolve [41923] indices for action [indices:admin/settings/update] and user [esbench]

Also, we do see non-trivial CPU going into the internal DS rollover requests sent by ILM that are all node-local as well.

Can we possibly have more quick wins here for this class of requests @ywangd?

@original-brownbear original-brownbear added >bug :Security/Authorization Roles, Privileges, DLS/FLS, RBAC/ABAC labels Oct 21, 2021
@elasticmachine elasticmachine added the Team:Security Meta label for security team label Oct 21, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-security (Team:Security)

@ywangd
Copy link
Member

ywangd commented Oct 25, 2021

How were these indices monitoring requests issued? Is it one request per index? The slow log is for loadAuthorizedIndices which is mostly related to total number of indices. So it takes the same time for requests targeting * or a_single_index. I suspect that the multiple slow log messages you observed is because multiple requests were issued each for a single index.

The child action for indices:monitor/stats is indices:monitor/stats[n] which should already been short-circuit (e.g. it didn't appear in slow logs). The parent indices:monitor/stats action seems only triggered by REST level requests. So if there were indeed multiple REST level requests, this is the REST level congestion problem we discussed previously while the short-circuit is mostly for transport level requests. Before we try figure out some remediation, it would be great if you could confirm (or dismiss) it is indeed the REST level congestion issue.

Also happy to chat lively if it helps.

@ywangd
Copy link
Member

ywangd commented Oct 25, 2021

I'd like to clarify what I actually mean by "REST level": It does not have to be an actual REST API call. It can be a request sent at the transport level directly using NodeClient. The point is that the request leads to a IndicesStatsRequest object being created and subsequent authorized and there are many of them. I hope that makes sense.

@original-brownbear
Copy link
Member Author

original-brownbear commented Oct 27, 2021

Sorry for the delay here @ywangd. That explanation makes perfect sense and agrees with the profiling which shows the node client calls coming out of ILM.

image

We can't short-circuit this code path somehow can we? I suppose if not the proper fix would be to move away from using the node client and invoking the actions directly via the transport service (or some other means which would obviously fix things I guess :))?

@ywangd
Copy link
Member

ywangd commented Oct 29, 2021

We can't short-circuit this code path somehow can we? I suppose if not the proper fix would be to move away from using the node client and invoking the actions directly via the transport service (or some other means which would obviously fix things I guess :))?

Unfortunately, I don't think using transportService directly is going to make a difference here because TransportIndicesStatsAction is a HandledTransportAction and its handler calls into TransportAction#execute which in turn invokes securityActionFilter. So the same loadAuthorizedIndices method will run. There is no easy win that I can think of here. Could you please provide a bit more details on how I can replicate this issue? It does not have to be on the scale that you are experience. Just a simple one that can help me to see the whole workflow and complete sequence of calls. Maybe I can help identify something to work on after examine the complete context more closely?

@ywangd
Copy link
Member

ywangd commented Oct 31, 2021

One thing I am trying to find out is what are the indices specified for these requests. Do they have wildcards? If they do not contain wildcards, we should technically be able to short circuit the step of loading authorized indices (which is what the slow logs are about).

@original-brownbear
Copy link
Member Author

Sorry for the massive delay here @ywangd The problem in case of these requests still remains in master and what we do is, we execute requests per each single index that ILM wants to get stats for. This causes thousands of individual stats requests to all run this expensive authorization for a single index.
I guess the problem is that these requests are all org.elasticsearch.action.IndicesRequest.Replaceable which fail this check:

  private boolean requiresWildcardExpansion(IndicesRequest indicesRequest) {
        // IndicesAliasesRequest requires special handling because it can have wildcards in request body
        if (indicesRequest instanceof IndicesAliasesRequest) {
            return true;
        }
        // Replaceable requests always require wildcard expansion
        if (indicesRequest instanceof IndicesRequest.Replaceable) {
            return true;
        }
        return false;
    }

even though in this specific case they actually are just made for a single concrete index. Anything easy fix available to us here? It seems as though stats requests in all situations are effectively the last slow authz requests left in our benchmarking but they hurt quite a bit due to ILM's extensive use of them.

Thanks!

@ywangd
Copy link
Member

ywangd commented Jan 18, 2022

I have a draft PR #81237 for avoid loadingAuthorizedIndices if the requested indices do not contain wildcard. The PR is promising and just need some polishments. Which version do you want this change to be part of, 8.1? 8.0? 7.17?

@original-brownbear
Copy link
Member Author

Thanks @ywangd , I think 8.1 is just fine here. We have other issues around the stats calls that gate it in spots not related to authz and the fixes will only land in 8.1 probably. If it's trivial to make it land in 8.0 it's still a win without those, but it's not worth rushing anything in any way.

@ywangd
Copy link
Member

ywangd commented Jan 25, 2022

@original-brownbear Is it possible to have a benchmark for the in-progress PR #81237? It would be great to have concrete benchmark proof for its review and approval. Thanks!

@original-brownbear
Copy link
Member Author

I'm on it @ywangd! Will try to get you results today

@ywangd
Copy link
Member

ywangd commented Jun 14, 2022

@original-brownbear Should this be closed since the PR (#81237) was merged a while back?

@original-brownbear
Copy link
Member Author

Right thanks for the ping Yang!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Security/Authorization Roles, Privileges, DLS/FLS, RBAC/ABAC Team:Security Meta label for security team
Projects
None yet
Development

No branches or pull requests

3 participants