Add production tunable configuration #259

RafalKorepta · 2023-01-05T20:39:31Z

Redpanda configuration that is cluster wide and could be changed on runtime.

	default	this PR	available from	source code link
log_segment_size	1 GiB	128 Mb	at least 21.11.x	line 37
log_segment_size_min	1 MiB	16 Mb	22.3.x	line 47
log_segment_size_max	null	256 Mb	22.3.x	line 56
kafka_batch_max_bytes	1 MiB	1 Mb	22.3.x	line 867
topic_partitions_per_shard	7000	1000	22.2.x	line 161
compacted_log_segment_size	256 MiB	64 Mb	at least 21.11.x	line 74
max_compacted_log_segment_size	5 GiB	512 Mb	at least 21.11.x	line 803
kafka_connection_rate_limit	null	1000	22.1.x	line 438
group_topic_partitions	16	16	at least 21.11.x	line 507

REF:
#203

jcsp · 2023-01-09T18:20:53Z

charts/redpanda/values.yaml

+    topic_partitions_per_shard: 1000
+    compacted_log_segment_size: 67108864                           # 64 mb
+    max_compacted_log_segment_size: 536870912                      # 512 mb
+    kafka_connections_max: 15100


Setting a connection limit probably doesn't make sense, if you don't know how large the machines you're installing onto are.

Does core documented of the mapping between CPU and memory to the machine type?

Nope. In cloud SaaS contexts, the connection count is part of the definition of the product, we do not have a matrix of how many connections a given self-hosted machine type can handle (with memory limits, it's hard to define, because a system with lots of partitions can handle fewer clients, etc, so it depends on the workload).

The default for self hosted clusters today is not to apply a connection count limit.

jcsp · 2023-01-09T18:21:40Z

charts/redpanda/values.yaml

+    max_compacted_log_segment_size: 536870912                      # 512 mb
+    kafka_connections_max: 15100
+    kafka_connection_rate_limit: 1000
+    partition_autobalancing_mode: "continuous"


continuous autobalancing requires a license, so should not be on by default unless helm can know whether it's installing with a license.

Ok, thanks. I will see if there is any way to check that value and reject if license is not provided.

You can just check that one of these is set:

helm-charts/charts/redpanda/values.yaml

Lines 45 to 46 in e228c87

license_key: ""

license_secret_ref: {}

jcsp · 2023-01-09T18:22:15Z

charts/redpanda/values.yaml

+    kafka_connections_max: 15100
+    kafka_connection_rate_limit: 1000
+    partition_autobalancing_mode: "continuous"
+    cloud_storage_segment_max_upload_interval_sec: 3600            # 3600 sec = 1 hour


The upload interval setting should be removed, this only makes sense in the context of some intended RPO.

Does Redpanda have documentation how end user can determine what would be the sane default? How to discover the right value in their workload? What are the benefits to set it low? How it will influence Redpanda cluster?

It's complicated (we do not provide official guidance for this at the moment, scale testing is ongoing). For things like this where there is no simple answer, I don't think the Helm chart is the place to try and address this for self hosted systems: our lives will be simpler if Helm just inherits the redpanda default (which is null), and thereby matches other self hosted systems.

jcsp · 2023-01-09T18:23:32Z

charts/redpanda/values.yaml

@@ -550,7 +550,22 @@ config:
    # tm_sync_timeout_ms: 2000ms                                   # Time to wait state catch up before rejecting a request
    # tm_violation_recovery_policy: crash                          # Describes how to recover from an invariant violation happened on the transaction coordinator level
    # transactional_id_expiration_ms: 10080min                     # Producer ids are expired once this time has elapsed after the last write with the given producer ID
-  tunable: {}
+  tunable:
+    log_segment_size: 134217728                                    # 128 mb


There's a docs ticket here https://github.com/redpanda-data/documentation/issues/1000 for describing how Cloud systems have a different default than self hosted systems.

If self hosted (helm) systems start using a different set of defaults too (which will probably diverge from Cloud eventually), then that should be flagged to the docs team as well.

I mentioned it to documentation team in the ticket you attached https://github.com/redpanda-data/documentation/issues/1000#issuecomment-1380099331.

jcsp · 2023-01-09T18:24:27Z

charts/redpanda/values.yaml

+    partition_autobalancing_mode: "continuous"
+    cloud_storage_segment_max_upload_interval_sec: 3600            # 3600 sec = 1 hour
+    group_topic_partitions: 16
+    # cloud_storage_enable_remote_read: true                       # cluster wide configuration for read from remote cloud storage


Commented lines left in by mistake?

No @vuldin provided a lot examples in the values.yaml. This is normal in helm charts.

No, commented lines are left in as documentation.

So now we are adding these items, what are the sideaffects to existing and and upgrading previous versions?

This comment is pinned to lines 566. It is parameter available in Redpanda from 22.11.X version as far as I remember. If you have license (not enforced yet) and configures tiered storage https://docs.redpanda.com/docs/platform/data-management/remote-read-replicas/#creating-a-topic-with-archival-storage-or-tiered-storage

jcsp · 2023-01-09T18:25:29Z

Be aware that this will make helm stop working with pre-v22.3.x redpandas: I don't mind, but i don't know if we document anywhere what Redpanda versions the helm chart is meant to work with.

joejulian · 2023-01-11T22:49:10Z

Be aware that this will make helm stop working with pre-v22.3.x redpandas: I don't mind, but i don't know if we document anywhere what Redpanda versions the helm chart is meant to work with.

I assume you mean that these defaults will do that. The user can override any of these defaults.

joejulian · 2023-01-11T22:51:59Z

What's the source for these values? I think you said this from some performance tested configuration that's documented somewhere, didn't you?

RafalKorepta · 2023-01-12T10:07:56Z

What's the source for these values? I think you said this from some performance tested configuration that's documented somewhere, didn't you?

It is in our internal repo https://github.com/redpanda-data/team-kubernetes-internal/issues/3

jcsp · 2023-01-12T10:20:35Z

I assume you mean that these defaults will do that. The user can override any of these defaults.

Right, it just means the defaults won't work with older clusters. It might not be super easy for the user to figure out why, they'd have to go look at redpanda logs to see errors like this, when their pods won't stay up:

3532208: ERROR 2023-01-12 10:19:36,031 [shard 0] main - application.cc:337 - Failure during startup: std::invalid_argument (Unknown property bloogle)

alejandroEsc

Just a question before i approve. What defaults were taken before we added these fields? Are the aligned? What is the side effects of adding these now? If there is nothing to worry about im OK in approving.

alejandroEsc · 2023-01-12T14:21:18Z

charts/redpanda/values.yaml

+    partition_autobalancing_mode: "continuous"
+    cloud_storage_segment_max_upload_interval_sec: 3600            # 3600 sec = 1 hour
+    group_topic_partitions: 16
+    # cloud_storage_enable_remote_read: true                       # cluster wide configuration for read from remote cloud storage


So now we are adding these items, what are the sideaffects to existing and and upgrading previous versions?

RafalKorepta · 2023-01-12T15:19:17Z

What defaults were taken before we added these fields?

I will update cover letter to address this comment with the appropriate link to the source code.

What is the side effects of adding these now?

You mean what are the side effects to add this if some cluster is already running?

It depends on the cluster version. As helm chart doesn't have validation webhook I will probably use clunky templating system to guard this.

alejandroEsc · 2023-01-12T15:30:53Z

Redpanda configuration that is cluster wide and could be changed on runtime.

default this PR available from source code link
log_segment_size 1 GiB 128 Mb at least 21.11.x line 37
log_segment_size_min 1 MiB 16 Mb 22.3.x line 47
log_segment_size_max null 256 Mb 22.3.x line 56
kafka_batch_max_bytes 1 MiB 1 Mb 22.3.x line 867
topic_partitions_per_shard 7000 1000 22.2.x line 161
compacted_log_segment_size 256 MiB 64 Mb at least 21.11.x line 74
max_compacted_log_segment_size 5 GiB 512 Mb at least 21.11.x line 803
kafka_connection_rate_limit null 1000 22.1.x line 438
group_topic_partitions 16 16 at least 21.11.x line 507
REF: #203

Wow thank you for that, did not need you to do that, just wanted to know if these were known/defaults etc.. but awesome!

alejandroEsc · 2023-01-12T15:39:07Z

I assume you mean that these defaults will do that. The user can override any of these defaults.

Right, it just means the defaults won't work with older clusters. It might not be super easy for the user to figure out why, they'd have to go look at redpanda logs to see errors like this, when their pods won't stay up:
3532208: ERROR 2023-01-12 10:19:36,031 [shard 0] main - application.cc:337 - Failure during startup: std::invalid_argument (Unknown property bloogle)

If this is the case, we can pre-empt these changes in a release note for our users. Is there an easy way for them to determine what their current defaults are so they can take actions pre-emptively against these changes? I am assuming of course that they are not upgrading their redpanda images.

RafalKorepta · 2023-01-12T15:48:24Z

Is there an easy way for them to determine what their current defaults are so they can take actions pre-emptively against these changes?

Using Admin API they can see the values of the running cluster. There is even rpk cluster config export command to do this.

emaxerrno · 2023-01-12T16:40:44Z

@RafalKorepta why. not have all of the segment size match at 128MB? compacted_log_segment_size - @jcsp thoughts?

alejandroEsc

thank you for your responses!

Redpanda configuration that is cluster wide and could be changed on runtime. The properites that are not available in 22.2.x or 22.1.x are stripped based on semver.

RafalKorepta · 2023-01-13T07:53:29Z

Added template that will strip unknown property based on the cover letter

$ helm template redpanda charts/redpanda --set image.tag=v22.2.1
---
# Source: redpanda/templates/configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: redpanda
  namespace: "redpanda"
  labels:
    helm.sh/chart: redpanda-2.4.5
    app.kubernetes.io/name: redpanda
    app.kubernetes.io/instance: "redpanda"
    app.kubernetes.io/managed-by: "Helm"
    app.kubernetes.io/component: redpanda
data:
  bootstrap.yaml: |
    enable_sasl: false
    compacted_log_segment_size: 67108864
    group_topic_partitions: 16
    kafka_connection_rate_limit: 1000
    log_segment_size: 134217728
    max_compacted_log_segment_size: 536870912
    topic_partitions_per_shard: 1000
    storage_min_free_bytes: 1073741824
...

$ helm template redpanda charts/redpanda --set image.tag=v22.3.1
---
# Source: redpanda/templates/configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: redpanda
  namespace: "redpanda"
  labels:
    helm.sh/chart: redpanda-2.4.5
    app.kubernetes.io/name: redpanda
    app.kubernetes.io/instance: "redpanda"
    app.kubernetes.io/managed-by: "Helm"
    app.kubernetes.io/component: redpanda
data:
  bootstrap.yaml: |
    enable_sasl: false
    compacted_log_segment_size: 67108864
    group_topic_partitions: 16
    kafka_batch_max_bytes: 1048576
    kafka_connection_rate_limit: 1000
    log_segment_size: 134217728
    log_segment_size_max: 268435456
    log_segment_size_min: 16777216
    max_compacted_log_segment_size: 536870912
    topic_partitions_per_shard: 1000
    storage_min_free_bytes: 1073741824
...

$ helm template redpanda charts/redpanda --set image.tag=v22.1.1
---
# Source: redpanda/templates/configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: redpanda
  namespace: "redpanda"
  labels:
    helm.sh/chart: redpanda-2.4.5
    app.kubernetes.io/name: redpanda
    app.kubernetes.io/instance: "redpanda"
    app.kubernetes.io/managed-by: "Helm"
    app.kubernetes.io/component: redpanda
data:
  bootstrap.yaml: |
    enable_sasl: false
    compacted_log_segment_size: 67108864
    group_topic_partitions: 16
    kafka_connection_rate_limit: 1000
    log_segment_size: 134217728
    max_compacted_log_segment_size: 536870912

And I fixed wrong invocation of hasKey function.

…/gh-priv-3/production-settings Add production tunable configuration

RafalKorepta requested review from joejulian and alejandroEsc January 5, 2023 20:39

RafalKorepta force-pushed the rk/gh-priv-3/production-settings branch from 4437678 to 3f1d6ef Compare January 9, 2023 10:54

jcsp reviewed Jan 9, 2023

View reviewed changes

RafalKorepta force-pushed the rk/gh-priv-3/production-settings branch from 3f1d6ef to 206ac16 Compare January 12, 2023 12:48

RafalKorepta requested a review from jcsp January 12, 2023 12:48

alejandroEsc reviewed Jan 12, 2023

View reviewed changes

joejulian approved these changes Jan 12, 2023

View reviewed changes

alejandroEsc self-requested a review January 13, 2023 01:41

alejandroEsc approved these changes Jan 13, 2023

View reviewed changes

Rafal Korepta added 2 commits January 13, 2023 08:51

Add production tunable configuration

76f4f5f

Redpanda configuration that is cluster wide and could be changed on runtime. The properites that are not available in 22.2.x or 22.1.x are stripped based on semver.

Swap hasKey invocation arguments

28baf5e

RafalKorepta force-pushed the rk/gh-priv-3/production-settings branch from 206ac16 to 28baf5e Compare January 13, 2023 07:52

RafalKorepta merged commit acec179 into redpanda-data:main Jan 13, 2023

RafalKorepta pushed a commit to redpanda-data/redpanda-operator that referenced this pull request Dec 3, 2024

Merge pull request redpanda-data/helm-charts#259 from RafalKorepta/rk…

eff5db5

…/gh-priv-3/production-settings Add production tunable configuration

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add production tunable configuration #259

Add production tunable configuration #259

RafalKorepta commented Jan 5, 2023 •

edited

Loading

jcsp Jan 9, 2023

RafalKorepta Jan 9, 2023

jcsp Jan 12, 2023

jcsp Jan 9, 2023

RafalKorepta Jan 9, 2023

joejulian Jan 13, 2023

jcsp Jan 9, 2023

RafalKorepta Jan 9, 2023

jcsp Jan 12, 2023 •

edited

Loading

jcsp Jan 9, 2023

RafalKorepta Jan 12, 2023

jcsp Jan 9, 2023

RafalKorepta Jan 9, 2023

joejulian Jan 11, 2023

alejandroEsc Jan 12, 2023

RafalKorepta Jan 12, 2023

jcsp commented Jan 9, 2023

joejulian commented Jan 11, 2023

joejulian commented Jan 11, 2023

RafalKorepta commented Jan 12, 2023

jcsp commented Jan 12, 2023

alejandroEsc left a comment

alejandroEsc Jan 12, 2023

RafalKorepta commented Jan 12, 2023

alejandroEsc commented Jan 12, 2023

alejandroEsc commented Jan 12, 2023

RafalKorepta commented Jan 12, 2023

emaxerrno commented Jan 12, 2023

alejandroEsc left a comment

RafalKorepta commented Jan 13, 2023

Add production tunable configuration #259

Add production tunable configuration #259

Conversation

RafalKorepta commented Jan 5, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jcsp Jan 12, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jcsp commented Jan 9, 2023

joejulian commented Jan 11, 2023

joejulian commented Jan 11, 2023

RafalKorepta commented Jan 12, 2023

jcsp commented Jan 12, 2023

alejandroEsc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RafalKorepta commented Jan 12, 2023

alejandroEsc commented Jan 12, 2023

alejandroEsc commented Jan 12, 2023

RafalKorepta commented Jan 12, 2023

emaxerrno commented Jan 12, 2023

alejandroEsc left a comment

Choose a reason for hiding this comment

RafalKorepta commented Jan 13, 2023

RafalKorepta commented Jan 5, 2023 •

edited

Loading

jcsp Jan 12, 2023 •

edited

Loading