Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default terminate_after for autocomplete requests to unlimited #38073

Closed
flash1293 opened this issue Jun 5, 2019 · 14 comments
Closed

Default terminate_after for autocomplete requests to unlimited #38073

flash1293 opened this issue Jun 5, 2019 · 14 comments
Labels
enhancement New value added to drive a business result Team:Visualizations Visualization editors, elastic-charts and infrastructure

Comments

@flash1293
Copy link
Contributor

Currently, ES queries to fetch terms to provide autocomplete suggestions in filter controls and the query bar have a hard coded terminate_after limit to keep server load low.

With #17388 a setting for this value is introduced to be able to get better results in some circumstances.

IMHO it makes sense to default to no terminate_after limit in these requests but provide a setting to set it. This setting basically comes down to a tradeoff between accuracy and resource usage. With a default setting of 100000 for terminate_after the requests are very fast, but often not accurate which is very annoying for the user because they are not able to select terms they know are present in their dataset.

I'm always in favor of preventing a "Killbana" situation, but I think a more sensible default would be to be as accurate as possible and provide the option to re-gain speed in clusters where this becomes a problem. Additionally to terminate_after there is also a timeout of 1 second, so the request wouldn't run 30s like requests from visualizations even without terminate_after. So this change doesn't seem like a cluster killer and if it becomes one in certain situations, the warning label in #37643 pops up which hints at the problem.

Because this is a breaking behavior change, we should introduce this starting with 8.0 and label it as breaking change.

There are some assumptions in this and I could be totally wrong about them - What do you think @timroes @nreese @Bargs ?

@flash1293 flash1293 added enhancement New value added to drive a business result Team:Visualizations Visualization editors, elastic-charts and infrastructure labels Jun 5, 2019
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-app

@Bargs
Copy link
Contributor

Bargs commented Jun 5, 2019

I'm wary of changing the default. We generally don't want to ship features that don't scale out of the box. If we make the setting configurable I think the question we have to ask ourselves is what's worse:

A. Killing a cluster unexpectedly and then the admin has to figure out why, find this setting, change it, and get everything running again.

or

B. A user gets inaccurate results, complains to admin and gives them the details of the warning message, admin increases the setting carefully and makes their user happy.

the requests are very fast, but often not accurate which is very annoying for the user because they are not able to select terms they know are present in their dataset.

Do we have a lot of users actually complaining about this? I haven't heard of many, but maybe I'm out of the loop.

This setting basically comes down to a tradeoff between accuracy and resource usage.

Not just resource usage but also usability. Autocomplete needs to be fast in order to be useful. If every update takes a full second I think it will be annoying.

there is also a timeout of 1 second

I might be wrong, or my knowledge might be out of date, but I think the timeout may not actually kill the task on the ES side.

So this change doesn't seem like a cluster killer and if it becomes one in certain situations, the warning label in #37643 pops up which hints at the problem.

If terminate_after is not set by default, what would trigger the warning label and what would it say? I'm having a hard time imagining a warning that makes sense to the average user since this would be more of an admin concern.

@flash1293
Copy link
Contributor Author

flash1293 commented Jun 6, 2019

Thanks for your input @Bargs

We generally don't want to ship features that don't scale out of the box

Does the current default setting scale? Because clusters with a lot of data won't provide accurate terms lists.

I think the question we have to ask ourselves is what's worse:
A. Killing a cluster unexpectedly and then the admin has to figure out why, find this setting, change it, and get everything running again.
or
B. A user gets inaccurate results, complains to admin and gives them the details of the warning message, admin increases the setting carefully and makes their user happy.

You are right, but we also have to consider the frequency of these events happening. If 5000 people have to go trough the effort to reconfigure their cluster, it might be worth to have one user having a cluster crash where support can help figuring it out. This might be my missing experience, but the chance of these requests killing the server as in contrast to big dashboards which fires lots of requests with 30s timeouts seems pretty small. I'm aware there will be multiple autocomplete requests per user, but not multiple requests per second per user.

Do we have a lot of users actually complaining about this? I haven't heard of many, but maybe I'm out of the loop.

There were at least two support cases I've heard of in the last two weeks and the issue #17388 got some reactions and repeated questions.

Not just resource usage but also usability. Autocomplete needs to be fast in order to be useful. If every update takes a full second I think it will be annoying.

Not having complete results is also annoying. I might be wrong here, but I think users tolerate a slightly longer waiting time more than not having the option they are looking for. My point is if we are recommending users to increase the limit as soon as it gets hit, why bother setting it to this value at all? It's just an extra step to configure for them then.

I might be wrong, or my knowledge might be out of date, but I think the timeout may not actually kill the task on the ES side.

You are right, 1s is no hard upper limit, the actual request might take a bit longer.

If terminate_after is not set by default, what would trigger the warning label and what would it say? I'm having a hard time imagining a warning that makes sense to the average user since this would be more of an admin concern.

If the request times out instead of terminating early because of terminate_after, the user sees the same error message: Terms list is incomplete. Adjust the autocomplete settings in kibana.yml for more results.

I totally see your point, the thing which isn't clear to me and which I might have underestimated is how resource intensive and/or slow these kind of queries get on big clusters without the terminate_after setting. Will they run into the timeout in this case? Because if they will regularly do on production systems, we should definitely keep our current default. Or are we optimizing too far here?

@Bargs
Copy link
Contributor

Bargs commented Jun 6, 2019

there will be multiple autocomplete requests per user, but not multiple requests per second per user

We have a 100ms debounce on the suggestions update, so technically there could be 10 requests per user per second.

the thing which isn't clear to me and which I might have underestimated is how resource intensive and/or slow these kind of queries get on big clusters without the terminate_after setting. Will they run into the timeout in this case?

This I can't answer for certain. @lukasolson and @jpountz helped with these optimizations so they might have some input on that.

@jpountz
Copy link

jpountz commented Jun 7, 2019

These requests can certainly be very resource-intensive on large datasets, we need some protection anyway in my opinion. terminate_after was mostly used as a workaround to the fact that Elasticsearch doesn't honor the timeout parameter very well, it could take minutes before Elasticsearch figures out that the request has timed out and that it can stop processing it. In short the issue is that we currently only check the current execution time when moving on to a new segment. But in the worst-case scenario that you have a very large force-merged shard (which is typically the case for warm shards), this means that we never check the execution time, so the request will never stop prematurely because of the timeout. We recently improved this situation via elastic/elasticsearch#42291 by checking the execution time while collecting segments as well, which is expected to land in 7.3. So we could look into removing usage of terminate_after for this feature and only rely on timeout in Kibana 7.3+.

@flash1293
Copy link
Contributor Author

Awesome, thanks for your input, @jpountz
This change was planned for 8.0 anyway, but I think it makes sense to re-visit it later when nearing 8.0 and see what makes sense then, because there will probably change a lot.

@0xtf
Copy link

0xtf commented May 19, 2020

I've been going through some issues to understand what the current state of this issue, and others related to this, is. As a user who is encountering the notice to adjust autocomplete settings in Kibana in all my clusters I'd love to be able to address it.

From my understanding there is a 1s timeout for populating the fields that go into the control visualization. Is this still a hardcoded value or can it be changed (as far as I can tell the PR to allow this still hasn't made it to master)?

Sorry if this isn't the best way to talk about this.

@nreese
Copy link
Contributor

nreese commented May 19, 2020

The value can be configured via kibana.yml setting kibana.autocompleteTimeout and kibana.autocompleteTerminateAfter. The ability to configure these was included in Kibana 7.3.

@0xtf
Copy link

0xtf commented May 19, 2020

Guess I'm running into another "Not yet in Cloud" situations:

'kibana.autocompleteTimeout': is not allowed

Is there a way of seeing which options are available for cloud deployments?

@nreese
Copy link
Contributor

nreese commented May 19, 2020

cc @elastic/cloud-pms

@maggieghamry
Copy link
Contributor

Also had similar issue - only workaround was setting kibana.autocompleteTerminateAfter: 2000000

@0xtf
Copy link

0xtf commented Mar 12, 2021

Was that in Cloud @maggieghamry ?

@maggieghamry
Copy link
Contributor

maggieghamry commented Mar 12, 2021

@0xtf ECE 2.6 & Kibana 7.11.1

@markov00
Copy link
Member

Closing as it seems to be possible to configure kibana.autocompleteTerminateAfter now replaced by unifiedSearch.autocomplete.valueSuggestions.terminateAfter and unifiedSearch.autocomplete.valueSuggestions.timeout

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New value added to drive a business result Team:Visualizations Visualization editors, elastic-charts and infrastructure
Projects
None yet
Development

No branches or pull requests

8 participants