Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve default config values #5077

Merged
merged 10 commits into from
Jan 11, 2022
Merged

improve default config values #5077

merged 10 commits into from
Jan 11, 2022

Conversation

trevorwhitney
Copy link
Collaborator

@trevorwhitney trevorwhitney commented Jan 7, 2022

What this PR does / why we need it:
Default parallelise_shardable_queries to true (was false)
Default split_queries_by_interval to 30m (was 0s)
Default align_queries_with_step to true (was fasle) (kept the previous default due to the result of this change on queries with very large steps)

Default query_ingesters_within and max_chunk_age both to 2h to 3h and 2h respectively (the addtional hour provides a buffer on top of max_chunk_age). Previously, max_chunk_age was set to 1h and query_ingesters_within was set to 0, meaning always query ingesters. An ingester with no data for a query will return quickly, so the performance improvement by defaulting query_ingesters_within to 2h is minimal, and it does introduce complexity, since now if the max_chunk_age is increased, you could end up with un-queryable data on the ingester. I'm open to reverting this one based on feedback?

Default max_concurrent to the default value of parallelism, which is 10 (was 20, but should be the same as parallelism).

** Update **

After discussion, this PR contains of the subset of the changes above. Updates were made inline above using strikethrough

Checklist

  • Documentation added
  • Tests updated
  • Add an entry in the CHANGELOG.md about the changes.

@trevorwhitney
Copy link
Collaborator Author

looping in @cyriltovena who had an opinion about the change to query_ingesters_within.

@trevorwhitney trevorwhitney marked this pull request as draft January 7, 2022 18:32
@pull-request-size pull-request-size bot added size/M and removed size/S labels Jan 7, 2022
@trevorwhitney trevorwhitney marked this pull request as ready for review January 7, 2022 18:40
@@ -1009,7 +1009,7 @@ lifecycler:
# Number of times to try and transfer chunks when leaving before
# falling back to flushing to the store. Zero = no transfers are done.
# CLI flag: -ingester.max-transfer-retries
[max_transfer_retries: <int> | default = 10]
[max_transfer_retries: <int> | default = 0]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this one as it's covered in #4792

@KMiller-Grafana
Copy link
Contributor

Before I do a review, please consider that a change to any default value should probably be itemized in the Upgrading section of the docs. There is a 2.4.0 section titled "Change of some default limits to common values" that you might model these default value changes after.

I realize that these are changed for performance reasons, but it impacts Loki users who already have running clusters and are upgrading to the new version. They might wish to update their configuration due to the new defaults.

@trevorwhitney
Copy link
Collaborator Author

Good point, I'll update the upgrading guide on Monday.

Copy link
Contributor

@cyriltovena cyriltovena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

better default for sure.

Copy link
Contributor

@sandeepsukhani sandeepsukhani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good to me, but I would like to warn about enabling align_queries_with_step. When the steps are too large, it changes the query interval too much. A 2h step could modify the start and end time by up to 2h because we round them down, see https://github.com/cortexproject/cortex/blob/master/pkg/querier/queryrange/step_align.go#L20-L21

I would suggest removing the step align feature and always aligning queries by the split interval. However, we should not change the default value for now or hold onto merging this PR until we can discuss it in a Loki call.

Copy link
Contributor

@DylanGuedes DylanGuedes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Btw, maybe it is worth it to add an entry in the upgrading guide with instructions on how to deal with the new default values?
edit: nvm Karen already mentioned this 😄

docs/sources/configuration/_index.md Outdated Show resolved Hide resolved
pkg/querier/querier.go Outdated Show resolved Hide resolved
@@ -396,7 +396,7 @@ results_cache:
# Perform query parallelisations based on storage sharding configuration and
# query ASTs. This feature is supported only by the chunks storage engine.
# CLI flag: -querier.parallelise-shardable-queries
[parallelise_shardable_queries: <boolean> | default = false]
[parallelise_shardable_queries: <boolean> | default = true]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a big thing to change as it increases parallelism by ~16x for current schema defaults. On larger Loki clusters, it'll definitely be an improvement, but it's likely to be slower for smaller ones, which is why we haven't changed this default before. I'm definitely worried for the effects of defaulting this to true alongside the split_queries_by_interval changes. Perhaps we just change split_queries_by_interval and leave parallelise_shardable_queries=false? WDYT?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm, yeah I did not consider that impact on smaller clusters. I think in general our defaults should be geared for smaller clusters, so I'm in favor with setting this back to false and just keeping the split_queries_by_interval

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@trevorwhitney @owen-d
Some clusters have an issue with this default setting: #4613

Copy link
Collaborator Author

@trevorwhitney trevorwhitney Jan 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dfoxg commented in #4613, but for visibility, is this still an issue given the default change in #5204?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@trevorwhitney I think your link to the pull request is wrong

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed it was. Fixed

pkg/loki/loki.go Outdated Show resolved Hide resolved
trevorwhitney and others added 3 commits January 10, 2022 12:58
@trevorwhitney
Copy link
Collaborator Author

I rolled back the changes to align_queries_with_step and parallelise_shardable_queries given the concerns raised by @sandeepsukhani and @owen-d (thanks for the feedback!)

@trevorwhitney
Copy link
Collaborator Author

I think @slim-bean had some opinions around keeping the change to parallelise_shardable_queries, and has experience with smaller clusters and this not having too big a negative impact. I'm going to bring it back.

@owen-d owen-d merged commit 3091ccd into main Jan 11, 2022
@owen-d owen-d deleted the few-more-defaults-changes branch January 11, 2022 19:12
trevorwhitney added a commit that referenced this pull request Jan 11, 2022
* improve default config values

* change defaults for upstream query range package

* update changelog

* remove max_tranfer_retries fix as it is covered in another PR, 4792

* increase query ingesters within by 1h

This provides an additional buffer on top of the max_chunk_age.

Co-authored-by: Owen Diehl <[email protected]>

* add comment reminder to remove query range config hack

Co-authored-by: Owen Diehl <[email protected]>

* rollback change to align_queries_with_step and parallelise_shardable_queries

* add upgrading docs

* re-enable parallelise_shardable_queries by default

* add parallelise_shardable_queries back to upgrading doc

Co-authored-by: Owen Diehl <[email protected]>
(cherry picked from commit 3091ccd)
trevorwhitney added a commit that referenced this pull request Jan 12, 2022
* improve default config values (#5077)

* improve default config values

* change defaults for upstream query range package

* update changelog

* remove max_tranfer_retries fix as it is covered in another PR, 4792

* increase query ingesters within by 1h

This provides an additional buffer on top of the max_chunk_age.

Co-authored-by: Owen Diehl <[email protected]>

* add comment reminder to remove query range config hack

Co-authored-by: Owen Diehl <[email protected]>

* rollback change to align_queries_with_step and parallelise_shardable_queries

* add upgrading docs

* re-enable parallelise_shardable_queries by default

* add parallelise_shardable_queries back to upgrading doc

Co-authored-by: Owen Diehl <[email protected]>
(cherry picked from commit 3091ccd)

* Correct split_queries_by_interval lost in cherry-pick
james-callahan added a commit to james-callahan/kustomize-loki that referenced this pull request Jan 13, 2022
This removes several config values that are now being set to their default.
frontend_worker.parallelism is the only actual change, which *should* be set to the same as max_concurrent anyway.

See grafana/loki#5077 for some more info
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants