-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve Authorization performance in clusters with a large number of indices #67987
Comments
Pinging @elastic/es-security (Team:Security) |
Investigation notes:
|
Hey @albertzaharovits, I think we're hitting exactly this issue in a cluster with ~2500 indices. Elasticsearch running on Kubernetes, orchestrated by ECK is exhibiting readiness probe failures that we think could be tracked back to this. A
Much more detail can be found in the following forum discussion: https://discuss.elastic.co/t/elasticsearch-readiness-probe-failures/270581 (edit for clarification: the issue seems to significantly limit indexing throughput in this setup) This kind of makes x-pack security unusable for large clusters (our current v5 production cluster is several times larger than this, we're looking at upgrading now). Do you know if there is traction on this (and the related #68004)? Thank you. Cheers, János |
Hi @jcsorvasi! I appreciate your input so far. We're currently investigating, but not committing for a fix just yet. I understand there are ~2500 indices in the cluster that you're ingesting into. I have a couple of follow-up question to that:
If possible, can you create a JFR dump for the node, and share it with us? You need to restart the node with the following in
and then issue:
when the ingest node is hiccuping. In addition to the stacktraces it would provide us with information on the memory load, to help evaluate the caching opportunities. I am thankful for whatever details you can provide to help us! Also, from the forum link https://discuss.elastic.co/t/elasticsearch-readiness-probe-failures/270581 it looks like you've got some CPU capacity left. In this case I think you could experiment with increasing the number of transport threads, so that, even if inefficient, authz can saturate more of the available cpu capacity, increasing throughput, ie, try setting on the ingest node:
|
Hey @albertzaharovits, Let me answer some of your questions quickly now, gathering Flight Recorder data is going to take some time as we've destroyed the cluster already (note on that in the forum post in my latest comment). I'll hopefully have an opportunity to rebuild it sometime next week and get you the data you need, as well as do some comparison between identical clusters receiving the same data, one with xpack security turned on and one with it off.
In short: aliases. Our index setup is basically per-app weekly indices with daily aliases (from before datastreams were a thing). For example:
Indices and aliases are pre-created for the future by a management tool. Just to give you some numbers, since we rebuilt the (test) cluster on Tuesday and started ingesting production data into it, the cluster now has 1116 indices, 4852 aliases and 3846 active shards. Our current live (v5) production cluster has a rolling retention of 32 days, with 7646 indices, 43126 aliases and 30244 active shards. The indexer (logstash with the elasticsearch output) is configured with the following parameter:
This way (almost) all index requests are sent to their respective alias for the current day and so each bulk request has the potential to include up 6000 index requests, possibly touching hundreds of indices or more.
I would say most of the indices. As I mentioned, there are per-app indices and how much an app logs varies wildly. Some of the chattiest apps sometimes produce tens of thousands of messages per second, some only a handful for the entire day.
We didn't go far into using all the security features, the setup is basically what ECK sets up. All apps around ES (logstash, kibana, etc.) use the built-in
Vanilla ECK setup, no extra users or roles created by us.
Once a day a management tool runs that pre-creates aliases (per app) for the future, ~1300 aliases created. Thanks for the tip on increasing the transport thread count, I'll try it out next week! |
Hi @albertzaharovits : We are facing exactly similar performance issue with authorization. We have around 3K indices and 20K aliases. After upgrading to 7.9.0 with XPACK authentication, all the ingest and search operations have slowed down drastically. Our ingest performance use to be around 20ms, now its in 500ms. All the hot_threads are showing transport_worker with loadAuthorizedIndices stack trace. Also we are noticing CPU is not fully utilized in each POD, however the CPU % from _cat/nodes is showing 100% usage. We have created a user and assigned superuser role. Is there any recommended tuning to improve the performance? Is there a way to disable index authorization? |
Hi! We also have a large cluster impacted by this issue. We upgraded from ES 6.2.3 with search guard to ES 7.9.3 with xpack.
Our stacks from hot_threads are similar to those provided in the first topic. Also are similar in search guard case. Also, the load of masters is quite different:
Also, the interval of full cluster restart is different:
|
@ilavajuthy @orinciog Thank you for the new imput on this one! TLDR I strongly recommend you upgrade to 7.16, which just came out . Or at least I would be grateful if you could run the same benchmarks on it. There are many changes between the 6.2 and 7.9 releases, but honestly we didn't benchmark comparatively. I suspect, though, that the performance issues might not be directly caused by the Security code changes. At the moment we are comfortable with the state of this in 7.16, and we're not planning any changes soon. Thank you again for taking the time to report it. |
Hi! @albertzaharovits Right now, for us it's a lit bit tricky to update to 7.16 (We have a series of Kibana legacy plugins and we are stuck with Kibana 7.9.3 - the latest version of Kibana that supports legacy plugins) I want to ask you if you compared the performance of es 7.x and 7.16 in terms of large clusters (high number of indices, high ingest rate) and if you have some comparatives between them (with xpack and without xpack). From our numbers, in 7.9.3, with xpack enabled the performance is about 10x worst than without xpack (all others factors remain the same - ingesting rate, number of machines etc). If you have such comparatives and the results of them are ok, than we could try to migrate to latest version. Thank you again for your effort and input, |
Hi @albertzaharovits : Thanks for the details that 7.16 have fixes to improve performance. Moving to 7.16 will take sometime for us. In the meantime is there any option to disable authorization and have only authentication for super user roles? |
Hi @orinciog! I can't reliabily help you out with benchmarking, I'm sorry. If you go about benchmarking on your own, you can start from this recent blog post https://www.elastic.co/blog/seven-tips-for-better-elasticsearch-benchmarks , and please do let us know about the test scenario (total number of indices/aliases/data streams, the number of indices/aliases/data streams that the user(s) are authorized for, the number of cluster nodes, whether you're benchmarking ingestion or searching and if searching, how many indices are searched in a single request (search requests can contain wildcards)). Hi @ilavajuthy Another option would be to implement a plugin with a custom authorization engine, see: |
@albertzaharovits |
We run a massive 300+ servers cluster and we started seeing drop in throughput as our number on indices increased. After, we went live in production, after 15 days, as the number of indices increased, we started seeing slowness in our ingestion pipelines Upon further investigation, we figured out RBAC Engine is the one causing slowness for the overall cluster. We did a rolling upgrade from 7.10.2 to 7.17.8 version and we are seeing a massive 2x-3x throughput on the same hardware and with same workload 7.10.2 Version - For a 300 node cluster, we were earlier not able to go beyond 450k indexing operations per second. |
The
RBACEngine#resolveAuthorizedIndicesForRole
computes a list of all the indices that a role grants access to, given the request action. This list is subsequently used during the evaluation of the request's index name expression.The issue is that the list can be large. Worse, the expensive, in both time and memory, list creation (eg the repeated resizings) might take place on
transport_worker
threads that have to multiplex (select) reads between multiple TCP connections (channels, in netty/nio terminology), effectively delaying responses on the other connections (which might lead to cluster instability).A list is wasteful because the request's wildcard will generally only match a small subset of the full list, which is the only data that needs to be stored in the re-written request. Ideally, the "authorized indices" full list should instead be an
Iterable
generator, which doesn't store all the names in memory. The generator refactoring, can also be improved by another refactoring, where we traverse the authorized indices only once.Here's a very long list of very long stack traces from a hot threads output, that I've based my above analysis on:
The text was updated successfully, but these errors were encountered: