Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add production tunable configuration #259

Merged

Conversation

RafalKorepta
Copy link
Contributor

@RafalKorepta RafalKorepta commented Jan 5, 2023

Redpanda configuration that is cluster wide and could be changed on runtime.

default this PR available from source code link
log_segment_size 1 GiB 128 Mb at least 21.11.x line 37
log_segment_size_min 1 MiB 16 Mb 22.3.x line 47
log_segment_size_max null 256 Mb 22.3.x line 56
kafka_batch_max_bytes 1 MiB 1 Mb 22.3.x line 867
topic_partitions_per_shard 7000 1000 22.2.x line 161
compacted_log_segment_size 256 MiB 64 Mb at least 21.11.x line 74
max_compacted_log_segment_size 5 GiB 512 Mb at least 21.11.x line 803
kafka_connection_rate_limit null 1000 22.1.x line 438
group_topic_partitions 16 16 at least 21.11.x line 507

REF:
#203

@RafalKorepta RafalKorepta force-pushed the rk/gh-priv-3/production-settings branch from 4437678 to 3f1d6ef Compare January 9, 2023 10:54
topic_partitions_per_shard: 1000
compacted_log_segment_size: 67108864 # 64 mb
max_compacted_log_segment_size: 536870912 # 512 mb
kafka_connections_max: 15100
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting a connection limit probably doesn't make sense, if you don't know how large the machines you're installing onto are.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does core documented of the mapping between CPU and memory to the machine type?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope. In cloud SaaS contexts, the connection count is part of the definition of the product, we do not have a matrix of how many connections a given self-hosted machine type can handle (with memory limits, it's hard to define, because a system with lots of partitions can handle fewer clients, etc, so it depends on the workload).

The default for self hosted clusters today is not to apply a connection count limit.

max_compacted_log_segment_size: 536870912 # 512 mb
kafka_connections_max: 15100
kafka_connection_rate_limit: 1000
partition_autobalancing_mode: "continuous"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

continuous autobalancing requires a license, so should not be on by default unless helm can know whether it's installing with a license.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thanks. I will see if there is any way to check that value and reject if license is not provided.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just check that one of these is set:

license_key: ""
license_secret_ref: {}

kafka_connections_max: 15100
kafka_connection_rate_limit: 1000
partition_autobalancing_mode: "continuous"
cloud_storage_segment_max_upload_interval_sec: 3600 # 3600 sec = 1 hour
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The upload interval setting should be removed, this only makes sense in the context of some intended RPO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does Redpanda have documentation how end user can determine what would be the sane default? How to discover the right value in their workload? What are the benefits to set it low? How it will influence Redpanda cluster?

Copy link

@jcsp jcsp Jan 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's complicated (we do not provide official guidance for this at the moment, scale testing is ongoing). For things like this where there is no simple answer, I don't think the Helm chart is the place to try and address this for self hosted systems: our lives will be simpler if Helm just inherits the redpanda default (which is null), and thereby matches other self hosted systems.

@@ -550,7 +550,22 @@ config:
# tm_sync_timeout_ms: 2000ms # Time to wait state catch up before rejecting a request
# tm_violation_recovery_policy: crash # Describes how to recover from an invariant violation happened on the transaction coordinator level
# transactional_id_expiration_ms: 10080min # Producer ids are expired once this time has elapsed after the last write with the given producer ID
tunable: {}
tunable:
log_segment_size: 134217728 # 128 mb
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a docs ticket here https://github.com/redpanda-data/documentation/issues/1000 for describing how Cloud systems have a different default than self hosted systems.

If self hosted (helm) systems start using a different set of defaults too (which will probably diverge from Cloud eventually), then that should be flagged to the docs team as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mentioned it to documentation team in the ticket you attached https://github.com/redpanda-data/documentation/issues/1000#issuecomment-1380099331.

partition_autobalancing_mode: "continuous"
cloud_storage_segment_max_upload_interval_sec: 3600 # 3600 sec = 1 hour
group_topic_partitions: 16
# cloud_storage_enable_remote_read: true # cluster wide configuration for read from remote cloud storage
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commented lines left in by mistake?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No @vuldin provided a lot examples in the values.yaml. This is normal in helm charts.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, commented lines are left in as documentation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So now we are adding these items, what are the sideaffects to existing and and upgrading previous versions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is pinned to lines 566. It is parameter available in Redpanda from 22.11.X version as far as I remember. If you have license (not enforced yet) and configures tiered storage https://docs.redpanda.com/docs/platform/data-management/remote-read-replicas/#creating-a-topic-with-archival-storage-or-tiered-storage

@jcsp
Copy link

jcsp commented Jan 9, 2023

Be aware that this will make helm stop working with pre-v22.3.x redpandas: I don't mind, but i don't know if we document anywhere what Redpanda versions the helm chart is meant to work with.

@joejulian
Copy link
Contributor

Be aware that this will make helm stop working with pre-v22.3.x redpandas: I don't mind, but i don't know if we document anywhere what Redpanda versions the helm chart is meant to work with.

I assume you mean that these defaults will do that. The user can override any of these defaults.

@joejulian
Copy link
Contributor

What's the source for these values? I think you said this from some performance tested configuration that's documented somewhere, didn't you?

@RafalKorepta
Copy link
Contributor Author

What's the source for these values? I think you said this from some performance tested configuration that's documented somewhere, didn't you?

It is in our internal repo https://github.com/redpanda-data/team-kubernetes-internal/issues/3

@jcsp
Copy link

jcsp commented Jan 12, 2023

I assume you mean that these defaults will do that. The user can override any of these defaults.

Right, it just means the defaults won't work with older clusters. It might not be super easy for the user to figure out why, they'd have to go look at redpanda logs to see errors like this, when their pods won't stay up:

3532208: ERROR 2023-01-12 10:19:36,031 [shard 0] main - application.cc:337 - Failure during startup: std::invalid_argument (Unknown property bloogle)

@RafalKorepta RafalKorepta force-pushed the rk/gh-priv-3/production-settings branch from 3f1d6ef to 206ac16 Compare January 12, 2023 12:48
@RafalKorepta RafalKorepta requested a review from jcsp January 12, 2023 12:48
Copy link
Contributor

@alejandroEsc alejandroEsc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a question before i approve. What defaults were taken before we added these fields? Are the aligned? What is the side effects of adding these now? If there is nothing to worry about im OK in approving.

partition_autobalancing_mode: "continuous"
cloud_storage_segment_max_upload_interval_sec: 3600 # 3600 sec = 1 hour
group_topic_partitions: 16
# cloud_storage_enable_remote_read: true # cluster wide configuration for read from remote cloud storage
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So now we are adding these items, what are the sideaffects to existing and and upgrading previous versions?

@RafalKorepta
Copy link
Contributor Author

What defaults were taken before we added these fields?

I will update cover letter to address this comment with the appropriate link to the source code.

What is the side effects of adding these now?

You mean what are the side effects to add this if some cluster is already running?

It depends on the cluster version. As helm chart doesn't have validation webhook I will probably use clunky templating system to guard this.

@alejandroEsc
Copy link
Contributor

Redpanda configuration that is cluster wide and could be changed on runtime.

default this PR available from source code link
log_segment_size 1 GiB 128 Mb at least 21.11.x line 37
log_segment_size_min 1 MiB 16 Mb 22.3.x line 47
log_segment_size_max null 256 Mb 22.3.x line 56
kafka_batch_max_bytes 1 MiB 1 Mb 22.3.x line 867
topic_partitions_per_shard 7000 1000 22.2.x line 161
compacted_log_segment_size 256 MiB 64 Mb at least 21.11.x line 74
max_compacted_log_segment_size 5 GiB 512 Mb at least 21.11.x line 803
kafka_connection_rate_limit null 1000 22.1.x line 438
group_topic_partitions 16 16 at least 21.11.x line 507
REF: #203

Wow thank you for that, did not need you to do that, just wanted to know if these were known/defaults etc.. but awesome!

@alejandroEsc
Copy link
Contributor

I assume you mean that these defaults will do that. The user can override any of these defaults.

Right, it just means the defaults won't work with older clusters. It might not be super easy for the user to figure out why, they'd have to go look at redpanda logs to see errors like this, when their pods won't stay up:

3532208: ERROR 2023-01-12 10:19:36,031 [shard 0] main - application.cc:337 - Failure during startup: std::invalid_argument (Unknown property bloogle)

If this is the case, we can pre-empt these changes in a release note for our users. Is there an easy way for them to determine what their current defaults are so they can take actions pre-emptively against these changes? I am assuming of course that they are not upgrading their redpanda images.

@RafalKorepta
Copy link
Contributor Author

Is there an easy way for them to determine what their current defaults are so they can take actions pre-emptively against these changes?

Using Admin API they can see the values of the running cluster. There is even rpk cluster config export command to do this.

@emaxerrno
Copy link
Contributor

@RafalKorepta why. not have all of the segment size match at 128MB? compacted_log_segment_size - @jcsp thoughts?

@alejandroEsc alejandroEsc self-requested a review January 13, 2023 01:41
Copy link
Contributor

@alejandroEsc alejandroEsc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for your responses!

Rafal Korepta added 2 commits January 13, 2023 08:51
Redpanda configuration that is cluster wide and could be changed on runtime.

The properites that are not available in 22.2.x or 22.1.x are stripped based
on semver.
@RafalKorepta RafalKorepta force-pushed the rk/gh-priv-3/production-settings branch from 206ac16 to 28baf5e Compare January 13, 2023 07:52
@RafalKorepta
Copy link
Contributor Author

Added template that will strip unknown property based on the cover letter

$ helm template redpanda charts/redpanda --set image.tag=v22.2.1
---
# Source: redpanda/templates/configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: redpanda
  namespace: "redpanda"
  labels:
    helm.sh/chart: redpanda-2.4.5
    app.kubernetes.io/name: redpanda
    app.kubernetes.io/instance: "redpanda"
    app.kubernetes.io/managed-by: "Helm"
    app.kubernetes.io/component: redpanda
data:
  bootstrap.yaml: |
    enable_sasl: false
    compacted_log_segment_size: 67108864
    group_topic_partitions: 16
    kafka_connection_rate_limit: 1000
    log_segment_size: 134217728
    max_compacted_log_segment_size: 536870912
    topic_partitions_per_shard: 1000
    storage_min_free_bytes: 1073741824
...

$ helm template redpanda charts/redpanda --set image.tag=v22.3.1
---
# Source: redpanda/templates/configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: redpanda
  namespace: "redpanda"
  labels:
    helm.sh/chart: redpanda-2.4.5
    app.kubernetes.io/name: redpanda
    app.kubernetes.io/instance: "redpanda"
    app.kubernetes.io/managed-by: "Helm"
    app.kubernetes.io/component: redpanda
data:
  bootstrap.yaml: |
    enable_sasl: false
    compacted_log_segment_size: 67108864
    group_topic_partitions: 16
    kafka_batch_max_bytes: 1048576
    kafka_connection_rate_limit: 1000
    log_segment_size: 134217728
    log_segment_size_max: 268435456
    log_segment_size_min: 16777216
    max_compacted_log_segment_size: 536870912
    topic_partitions_per_shard: 1000
    storage_min_free_bytes: 1073741824
...

$ helm template redpanda charts/redpanda --set image.tag=v22.1.1
---
# Source: redpanda/templates/configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: redpanda
  namespace: "redpanda"
  labels:
    helm.sh/chart: redpanda-2.4.5
    app.kubernetes.io/name: redpanda
    app.kubernetes.io/instance: "redpanda"
    app.kubernetes.io/managed-by: "Helm"
    app.kubernetes.io/component: redpanda
data:
  bootstrap.yaml: |
    enable_sasl: false
    compacted_log_segment_size: 67108864
    group_topic_partitions: 16
    kafka_connection_rate_limit: 1000
    log_segment_size: 134217728
    max_compacted_log_segment_size: 536870912

And I fixed wrong invocation of hasKey function.

@RafalKorepta RafalKorepta merged commit acec179 into redpanda-data:main Jan 13, 2023
RafalKorepta pushed a commit to redpanda-data/redpanda-operator that referenced this pull request Dec 3, 2024
…/gh-priv-3/production-settings

Add production tunable configuration
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants