Authorization for Internal Requests like Stats or Rollover is Slow in Large Clusters #79632

original-brownbear · 2021-10-21T13:58:30Z

This is a bit of a sub-issue of #67987 but I wonder if we can take similar short cuts here to the ones we took for e.g. local field caps requests.

For stats we have a lot of slow logging like this:

[2021-10-21T13:47:35,324][INFO ][o.e.x.s.a.AuthorizationService] [elasticsearch-8] Took [51ms] to resolve [42493] indices for action [indices:monitor/stats] and user [esbench]
[2021-10-21T13:47:35,956][INFO ][o.e.x.s.a.AuthorizationService] [elasticsearch-8] Took [52ms] to resolve [42493] indices for action [indices:monitor/stats] and user [esbench]
[2021-10-21T13:47:36,737][INFO ][o.e.x.s.a.AuthorizationService] [elasticsearch-8] Took [72ms] to resolve [42494] indices for action [indices:monitor/stats] and user [esbench]
[2021-10-21T13:47:37,199][INFO ][o.e.x.s.a.AuthorizationService] [elasticsearch-8] Took [81ms] to resolve [42494] indices for action [indices:monitor/stats] and user [esbench]
[2021-10-21T13:47:38,208][INFO ][o.e.x.s.a.AuthorizationService] [elasticsearch-8] Took [56ms] to resolve [42494] indices for action [indices:monitor/stats] and user [esbench]

on the master node once monitoring asks for stats. We do not have similar logging for field caps requests, so something must be different here in how the fan-out works. Could we short-circuit it here as well for a quick-win and more stability on transport threads?

We also see the same thing for setting updates (these are node local requests from ILM):

[2021-10-21T10:45:23,537][INFO ][o.e.x.s.a.AuthorizationService] [elasticsearch-8] Took [96ms] to resolve [41923] indices for action [indices:admin/settings/update] and user [esbench]

Also, we do see non-trivial CPU going into the internal DS rollover requests sent by ILM that are all node-local as well.

Can we possibly have more quick wins here for this class of requests @ywangd?

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-10-21T13:58:33Z

Pinging @elastic/es-security (Team:Security)

ywangd · 2021-10-25T07:17:59Z

How were these indices monitoring requests issued? Is it one request per index? The slow log is for loadAuthorizedIndices which is mostly related to total number of indices. So it takes the same time for requests targeting * or a_single_index. I suspect that the multiple slow log messages you observed is because multiple requests were issued each for a single index.

The child action for indices:monitor/stats is indices:monitor/stats[n] which should already been short-circuit (e.g. it didn't appear in slow logs). The parent indices:monitor/stats action seems only triggered by REST level requests. So if there were indeed multiple REST level requests, this is the REST level congestion problem we discussed previously while the short-circuit is mostly for transport level requests. Before we try figure out some remediation, it would be great if you could confirm (or dismiss) it is indeed the REST level congestion issue.

Also happy to chat lively if it helps.

ywangd · 2021-10-25T09:21:29Z

I'd like to clarify what I actually mean by "REST level": It does not have to be an actual REST API call. It can be a request sent at the transport level directly using NodeClient. The point is that the request leads to a IndicesStatsRequest object being created and subsequent authorized and there are many of them. I hope that makes sense.

original-brownbear · 2021-10-27T06:50:29Z

Sorry for the delay here @ywangd. That explanation makes perfect sense and agrees with the profiling which shows the node client calls coming out of ILM.

We can't short-circuit this code path somehow can we? I suppose if not the proper fix would be to move away from using the node client and invoking the actions directly via the transport service (or some other means which would obviously fix things I guess :))?

ywangd · 2021-10-29T05:14:46Z

We can't short-circuit this code path somehow can we? I suppose if not the proper fix would be to move away from using the node client and invoking the actions directly via the transport service (or some other means which would obviously fix things I guess :))?

Unfortunately, I don't think using transportService directly is going to make a difference here because TransportIndicesStatsAction is a HandledTransportAction and its handler calls into TransportAction#execute which in turn invokes securityActionFilter. So the same loadAuthorizedIndices method will run. There is no easy win that I can think of here. Could you please provide a bit more details on how I can replicate this issue? It does not have to be on the scale that you are experience. Just a simple one that can help me to see the whole workflow and complete sequence of calls. Maybe I can help identify something to work on after examine the complete context more closely?

ywangd · 2021-10-31T08:19:30Z

One thing I am trying to find out is what are the indices specified for these requests. Do they have wildcards? If they do not contain wildcards, we should technically be able to short circuit the step of loading authorized indices (which is what the slow logs are about).

original-brownbear · 2022-01-18T10:31:35Z

Sorry for the massive delay here @ywangd The problem in case of these requests still remains in master and what we do is, we execute requests per each single index that ILM wants to get stats for. This causes thousands of individual stats requests to all run this expensive authorization for a single index.
I guess the problem is that these requests are all org.elasticsearch.action.IndicesRequest.Replaceable which fail this check:

  private boolean requiresWildcardExpansion(IndicesRequest indicesRequest) {
        // IndicesAliasesRequest requires special handling because it can have wildcards in request body
        if (indicesRequest instanceof IndicesAliasesRequest) {
            return true;
        }
        // Replaceable requests always require wildcard expansion
        if (indicesRequest instanceof IndicesRequest.Replaceable) {
            return true;
        }
        return false;
    }

even though in this specific case they actually are just made for a single concrete index. Anything easy fix available to us here? It seems as though stats requests in all situations are effectively the last slow authz requests left in our benchmarking but they hurt quite a bit due to ILM's extensive use of them.

Thanks!

ywangd · 2022-01-18T11:00:30Z

I have a draft PR #81237 for avoid loadingAuthorizedIndices if the requested indices do not contain wildcard. The PR is promising and just need some polishments. Which version do you want this change to be part of, 8.1? 8.0? 7.17?

original-brownbear · 2022-01-18T11:41:49Z

Thanks @ywangd , I think 8.1 is just fine here. We have other issues around the stats calls that gate it in spots not related to authz and the fixes will only land in 8.1 probably. If it's trivial to make it land in 8.0 it's still a win without those, but it's not worth rushing anything in any way.

ywangd · 2022-01-25T01:44:11Z

@original-brownbear Is it possible to have a benchmark for the in-progress PR #81237? It would be great to have concrete benchmark proof for its review and approval. Thanks!

original-brownbear · 2022-01-25T18:53:52Z

I'm on it @ywangd! Will try to get you results today

ywangd · 2022-06-14T06:48:51Z

@original-brownbear Should this be closed since the PR (#81237) was merged a while back?

original-brownbear · 2022-06-14T06:51:10Z

Right thanks for the ping Yang!

original-brownbear added >bug :Security/Authorization Roles, Privileges, DLS/FLS, RBAC/ABAC labels Oct 21, 2021

elasticmachine added the Team:Security Meta label for security team label Oct 21, 2021

original-brownbear mentioned this issue Oct 21, 2021

Fix Large Shard Count Scalability Issues #77466

Open

97 tasks

ywangd mentioned this issue Jan 31, 2022

More avoidance for loadAuthorizedIndices #81237

Merged

original-brownbear closed this as completed Jun 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Authorization for Internal Requests like Stats or Rollover is Slow in Large Clusters #79632

Authorization for Internal Requests like Stats or Rollover is Slow in Large Clusters #79632

original-brownbear commented Oct 21, 2021 •

edited

Loading

elasticmachine commented Oct 21, 2021

ywangd commented Oct 25, 2021

ywangd commented Oct 25, 2021

original-brownbear commented Oct 27, 2021 •

edited

Loading

ywangd commented Oct 29, 2021

ywangd commented Oct 31, 2021

original-brownbear commented Jan 18, 2022

ywangd commented Jan 18, 2022

original-brownbear commented Jan 18, 2022

ywangd commented Jan 25, 2022

original-brownbear commented Jan 25, 2022

ywangd commented Jun 14, 2022

original-brownbear commented Jun 14, 2022

Authorization for Internal Requests like Stats or Rollover is Slow in Large Clusters #79632

Authorization for Internal Requests like Stats or Rollover is Slow in Large Clusters #79632

Comments

original-brownbear commented Oct 21, 2021 • edited Loading

elasticmachine commented Oct 21, 2021

ywangd commented Oct 25, 2021

ywangd commented Oct 25, 2021

original-brownbear commented Oct 27, 2021 • edited Loading

ywangd commented Oct 29, 2021

ywangd commented Oct 31, 2021

original-brownbear commented Jan 18, 2022

ywangd commented Jan 18, 2022

original-brownbear commented Jan 18, 2022

ywangd commented Jan 25, 2022

original-brownbear commented Jan 25, 2022

ywangd commented Jun 14, 2022

original-brownbear commented Jun 14, 2022

original-brownbear commented Oct 21, 2021 •

edited

Loading

original-brownbear commented Oct 27, 2021 •

edited

Loading