-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Authorization for Internal Requests like Stats or Rollover is Slow in Large Clusters #79632
Comments
Pinging @elastic/es-security (Team:Security) |
How were these indices monitoring requests issued? Is it one request per index? The slow log is for The child action for Also happy to chat lively if it helps. |
I'd like to clarify what I actually mean by "REST level": It does not have to be an actual REST API call. It can be a request sent at the transport level directly using |
Sorry for the delay here @ywangd. That explanation makes perfect sense and agrees with the profiling which shows the node client calls coming out of ILM. We can't short-circuit this code path somehow can we? I suppose if not the proper fix would be to move away from using the node client and invoking the actions directly via the transport service (or some other means which would obviously fix things I guess :))? |
Unfortunately, I don't think using transportService directly is going to make a difference here because |
One thing I am trying to find out is what are the indices specified for these requests. Do they have wildcards? If they do not contain wildcards, we should technically be able to short circuit the step of loading authorized indices (which is what the slow logs are about). |
Sorry for the massive delay here @ywangd The problem in case of these requests still remains in master and what we do is, we execute requests per each single index that ILM wants to get stats for. This causes thousands of individual stats requests to all run this expensive authorization for a single index. private boolean requiresWildcardExpansion(IndicesRequest indicesRequest) {
// IndicesAliasesRequest requires special handling because it can have wildcards in request body
if (indicesRequest instanceof IndicesAliasesRequest) {
return true;
}
// Replaceable requests always require wildcard expansion
if (indicesRequest instanceof IndicesRequest.Replaceable) {
return true;
}
return false;
} even though in this specific case they actually are just made for a single concrete index. Anything easy fix available to us here? It seems as though stats requests in all situations are effectively the last slow authz requests left in our benchmarking but they hurt quite a bit due to ILM's extensive use of them. Thanks! |
I have a draft PR #81237 for avoid loadingAuthorizedIndices if the requested indices do not contain wildcard. The PR is promising and just need some polishments. Which version do you want this change to be part of, 8.1? 8.0? 7.17? |
Thanks @ywangd , I think 8.1 is just fine here. We have other issues around the stats calls that gate it in spots not related to authz and the fixes will only land in 8.1 probably. If it's trivial to make it land in 8.0 it's still a win without those, but it's not worth rushing anything in any way. |
@original-brownbear Is it possible to have a benchmark for the in-progress PR #81237? It would be great to have concrete benchmark proof for its review and approval. Thanks! |
I'm on it @ywangd! Will try to get you results today |
@original-brownbear Should this be closed since the PR (#81237) was merged a while back? |
Right thanks for the ping Yang! |
This is a bit of a sub-issue of #67987 but I wonder if we can take similar short cuts here to the ones we took for e.g. local field caps requests.
For stats we have a lot of slow logging like this:
on the master node once monitoring asks for stats. We do not have similar logging for field caps requests, so something must be different here in how the fan-out works. Could we short-circuit it here as well for a quick-win and more stability on transport threads?
We also see the same thing for setting updates (these are node local requests from ILM):
Also, we do see non-trivial CPU going into the internal DS rollover requests sent by ILM that are all node-local as well.
Can we possibly have more quick wins here for this class of requests @ywangd?
The text was updated successfully, but these errors were encountered: