-
Notifications
You must be signed in to change notification settings - Fork 98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add production tunable configuration #259
Add production tunable configuration #259
Conversation
4437678
to
3f1d6ef
Compare
charts/redpanda/values.yaml
Outdated
topic_partitions_per_shard: 1000 | ||
compacted_log_segment_size: 67108864 # 64 mb | ||
max_compacted_log_segment_size: 536870912 # 512 mb | ||
kafka_connections_max: 15100 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting a connection limit probably doesn't make sense, if you don't know how large the machines you're installing onto are.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does core documented of the mapping between CPU and memory to the machine type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope. In cloud SaaS contexts, the connection count is part of the definition of the product, we do not have a matrix of how many connections a given self-hosted machine type can handle (with memory limits, it's hard to define, because a system with lots of partitions can handle fewer clients, etc, so it depends on the workload).
The default for self hosted clusters today is not to apply a connection count limit.
charts/redpanda/values.yaml
Outdated
max_compacted_log_segment_size: 536870912 # 512 mb | ||
kafka_connections_max: 15100 | ||
kafka_connection_rate_limit: 1000 | ||
partition_autobalancing_mode: "continuous" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
continuous autobalancing requires a license, so should not be on by default unless helm can know whether it's installing with a license.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, thanks. I will see if there is any way to check that value and reject if license is not provided.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can just check that one of these is set:
helm-charts/charts/redpanda/values.yaml
Lines 45 to 46 in e228c87
license_key: "" | |
license_secret_ref: {} |
charts/redpanda/values.yaml
Outdated
kafka_connections_max: 15100 | ||
kafka_connection_rate_limit: 1000 | ||
partition_autobalancing_mode: "continuous" | ||
cloud_storage_segment_max_upload_interval_sec: 3600 # 3600 sec = 1 hour |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The upload interval setting should be removed, this only makes sense in the context of some intended RPO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does Redpanda have documentation how end user can determine what would be the sane default? How to discover the right value in their workload? What are the benefits to set it low? How it will influence Redpanda cluster?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's complicated (we do not provide official guidance for this at the moment, scale testing is ongoing). For things like this where there is no simple answer, I don't think the Helm chart is the place to try and address this for self hosted systems: our lives will be simpler if Helm just inherits the redpanda default (which is null), and thereby matches other self hosted systems.
@@ -550,7 +550,22 @@ config: | |||
# tm_sync_timeout_ms: 2000ms # Time to wait state catch up before rejecting a request | |||
# tm_violation_recovery_policy: crash # Describes how to recover from an invariant violation happened on the transaction coordinator level | |||
# transactional_id_expiration_ms: 10080min # Producer ids are expired once this time has elapsed after the last write with the given producer ID | |||
tunable: {} | |||
tunable: | |||
log_segment_size: 134217728 # 128 mb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a docs ticket here https://github.com/redpanda-data/documentation/issues/1000 for describing how Cloud systems have a different default than self hosted systems.
If self hosted (helm) systems start using a different set of defaults too (which will probably diverge from Cloud eventually), then that should be flagged to the docs team as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mentioned it to documentation team in the ticket you attached https://github.com/redpanda-data/documentation/issues/1000#issuecomment-1380099331.
partition_autobalancing_mode: "continuous" | ||
cloud_storage_segment_max_upload_interval_sec: 3600 # 3600 sec = 1 hour | ||
group_topic_partitions: 16 | ||
# cloud_storage_enable_remote_read: true # cluster wide configuration for read from remote cloud storage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commented lines left in by mistake?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No @vuldin provided a lot examples in the values.yaml. This is normal in helm charts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, commented lines are left in as documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So now we are adding these items, what are the sideaffects to existing and and upgrading previous versions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is pinned to lines 566. It is parameter available in Redpanda from 22.11.X version as far as I remember. If you have license (not enforced yet) and configures tiered storage https://docs.redpanda.com/docs/platform/data-management/remote-read-replicas/#creating-a-topic-with-archival-storage-or-tiered-storage
Be aware that this will make helm stop working with pre-v22.3.x redpandas: I don't mind, but i don't know if we document anywhere what Redpanda versions the helm chart is meant to work with. |
I assume you mean that these defaults will do that. The user can override any of these defaults. |
What's the source for these values? I think you said this from some performance tested configuration that's documented somewhere, didn't you? |
It is in our internal repo https://github.com/redpanda-data/team-kubernetes-internal/issues/3 |
Right, it just means the defaults won't work with older clusters. It might not be super easy for the user to figure out why, they'd have to go look at redpanda logs to see errors like this, when their pods won't stay up:
|
3f1d6ef
to
206ac16
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a question before i approve. What defaults were taken before we added these fields? Are the aligned? What is the side effects of adding these now? If there is nothing to worry about im OK in approving.
partition_autobalancing_mode: "continuous" | ||
cloud_storage_segment_max_upload_interval_sec: 3600 # 3600 sec = 1 hour | ||
group_topic_partitions: 16 | ||
# cloud_storage_enable_remote_read: true # cluster wide configuration for read from remote cloud storage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So now we are adding these items, what are the sideaffects to existing and and upgrading previous versions?
I will update cover letter to address this comment with the appropriate link to the source code.
You mean what are the side effects to add this if some cluster is already running? It depends on the cluster version. As helm chart doesn't have validation webhook I will probably use clunky templating system to guard this. |
Wow thank you for that, did not need you to do that, just wanted to know if these were known/defaults etc.. but awesome! |
If this is the case, we can pre-empt these changes in a release note for our users. Is there an easy way for them to determine what their current defaults are so they can take actions pre-emptively against these changes? I am assuming of course that they are not upgrading their redpanda images. |
Using Admin API they can see the values of the running cluster. There is even |
@RafalKorepta why. not have all of the segment size match at 128MB? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you for your responses!
Redpanda configuration that is cluster wide and could be changed on runtime. The properites that are not available in 22.2.x or 22.1.x are stripped based on semver.
206ac16
to
28baf5e
Compare
Added template that will strip unknown property based on the cover letter
And I fixed wrong invocation of |
…/gh-priv-3/production-settings Add production tunable configuration
Redpanda configuration that is cluster wide and could be changed on runtime.
REF:
#203