-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
improve default config values #5077
Conversation
looping in @cyriltovena who had an opinion about the change to |
docs/sources/configuration/_index.md
Outdated
@@ -1009,7 +1009,7 @@ lifecycler: | |||
# Number of times to try and transfer chunks when leaving before | |||
# falling back to flushing to the store. Zero = no transfers are done. | |||
# CLI flag: -ingester.max-transfer-retries | |||
[max_transfer_retries: <int> | default = 10] | |||
[max_transfer_retries: <int> | default = 0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this one as it's covered in #4792
Before I do a review, please consider that a change to any default value should probably be itemized in the Upgrading section of the docs. There is a 2.4.0 section titled "Change of some default limits to common values" that you might model these default value changes after. I realize that these are changed for performance reasons, but it impacts Loki users who already have running clusters and are upgrading to the new version. They might wish to update their configuration due to the new defaults. |
Good point, I'll update the upgrading guide on Monday. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
better default for sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look good to me, but I would like to warn about enabling align_queries_with_step
. When the steps are too large, it changes the query interval too much. A 2h
step could modify the start and end time by up to 2h
because we round them down, see https://github.com/cortexproject/cortex/blob/master/pkg/querier/queryrange/step_align.go#L20-L21
I would suggest removing the step align feature and always aligning queries by the split interval. However, we should not change the default value for now or hold onto merging this PR until we can discuss it in a Loki call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Btw, maybe it is worth it to add an entry in the upgrading guide with instructions on how to deal with the new default values?
edit: nvm Karen already mentioned this 😄
@@ -396,7 +396,7 @@ results_cache: | |||
# Perform query parallelisations based on storage sharding configuration and | |||
# query ASTs. This feature is supported only by the chunks storage engine. | |||
# CLI flag: -querier.parallelise-shardable-queries | |||
[parallelise_shardable_queries: <boolean> | default = false] | |||
[parallelise_shardable_queries: <boolean> | default = true] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a big thing to change as it increases parallelism by ~16x for current schema defaults. On larger Loki clusters, it'll definitely be an improvement, but it's likely to be slower for smaller ones, which is why we haven't changed this default before. I'm definitely worried for the effects of defaulting this to true
alongside the split_queries_by_interval
changes. Perhaps we just change split_queries_by_interval
and leave parallelise_shardable_queries=false
? WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmm, yeah I did not consider that impact on smaller clusters. I think in general our defaults should be geared for smaller clusters, so I'm in favor with setting this back to false and just keeping the split_queries_by_interval
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@trevorwhitney @owen-d
Some clusters have an issue with this default setting: #4613
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@trevorwhitney I think your link to the pull request is wrong
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed it was. Fixed
This provides an additional buffer on top of the max_chunk_age. Co-authored-by: Owen Diehl <[email protected]>
Co-authored-by: Owen Diehl <[email protected]>
I rolled back the changes to |
I think @slim-bean had some opinions around keeping the change to |
* improve default config values * change defaults for upstream query range package * update changelog * remove max_tranfer_retries fix as it is covered in another PR, 4792 * increase query ingesters within by 1h This provides an additional buffer on top of the max_chunk_age. Co-authored-by: Owen Diehl <[email protected]> * add comment reminder to remove query range config hack Co-authored-by: Owen Diehl <[email protected]> * rollback change to align_queries_with_step and parallelise_shardable_queries * add upgrading docs * re-enable parallelise_shardable_queries by default * add parallelise_shardable_queries back to upgrading doc Co-authored-by: Owen Diehl <[email protected]> (cherry picked from commit 3091ccd)
* improve default config values (#5077) * improve default config values * change defaults for upstream query range package * update changelog * remove max_tranfer_retries fix as it is covered in another PR, 4792 * increase query ingesters within by 1h This provides an additional buffer on top of the max_chunk_age. Co-authored-by: Owen Diehl <[email protected]> * add comment reminder to remove query range config hack Co-authored-by: Owen Diehl <[email protected]> * rollback change to align_queries_with_step and parallelise_shardable_queries * add upgrading docs * re-enable parallelise_shardable_queries by default * add parallelise_shardable_queries back to upgrading doc Co-authored-by: Owen Diehl <[email protected]> (cherry picked from commit 3091ccd) * Correct split_queries_by_interval lost in cherry-pick
This removes several config values that are now being set to their default. frontend_worker.parallelism is the only actual change, which *should* be set to the same as max_concurrent anyway. See grafana/loki#5077 for some more info
What this PR does / why we need it:
Default
parallelise_shardable_queries
totrue
(wasfalse
)Default
split_queries_by_interval
to30m
(was0s
)Default(kept the previous default due to the result of this change on queries with very large steps)align_queries_with_step
totrue
(wasfasle
)Default
query_ingesters_within
andmax_chunk_age
both toto2h
3h
and2h
respectively (the addtional hour provides a buffer on top ofmax_chunk_age
). Previously,max_chunk_age
was set to1h
andquery_ingesters_within
was set to0
, meaning always query ingesters. An ingester with no data for a query will return quickly, so the performance improvement by defaultingquery_ingesters_within
to2h
is minimal, and it does introduce complexity, since now if themax_chunk_age
is increased, you could end up with un-queryable data on the ingester. I'm open to reverting this one based on feedback?Default
max_concurrent
to the default value ofparallelism
, which is10
(was20
, but should be the same asparallelism
).** Update **
After discussion, this PR contains of the subset of the changes above. Updates were made inline above using
strikethroughChecklist
CHANGELOG.md
about the changes.