Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ETCD-514: Add etcd size tuning #1549

Merged
merged 1 commit into from
May 14, 2024

Conversation

dusk125
Copy link
Contributor

@dusk125 dusk125 commented Jan 24, 2024

This enhancement describes allowing customers to tweak etcd backend quota in a controlled manor.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jan 24, 2024
@openshift-ci-robot
Copy link

openshift-ci-robot commented Jan 24, 2024

@dusk125: This pull request references ETCD-514 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

In response to this:

This enhancement describes allowing customers to tweak etcd backend quota in a controlled manor.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@dusk125
Copy link
Contributor Author

dusk125 commented Jan 24, 2024

ptal
@hasbro17 @tjungblu @Elbehery @williamcaban @soltysh

enhancements/etcd/etcd-size-tuning.md Outdated Show resolved Hide resolved
enhancements/etcd/etcd-size-tuning.md Outdated Show resolved Hide resolved
enhancements/etcd/etcd-size-tuning.md Show resolved Hide resolved
enhancements/etcd/etcd-size-tuning.md Show resolved Hide resolved
Copy link
Contributor

@hasbro17 hasbro17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gets the general idea across but need to mention some specifics on the API. This should streamline the actual API PR if we get those details outlined here first.

enhancements/etcd/etcd-size-tuning.md Outdated Show resolved Hide resolved
enhancements/etcd/etcd-size-tuning.md Outdated Show resolved Hide resolved
enhancements/etcd/etcd-size-tuning.md Outdated Show resolved Hide resolved
enhancements/etcd/etcd-size-tuning.md Outdated Show resolved Hide resolved
enhancements/etcd/etcd-size-tuning.md Outdated Show resolved Hide resolved
enhancements/etcd/etcd-size-tuning.md Outdated Show resolved Hide resolved
enhancements/etcd/etcd-size-tuning.md Outdated Show resolved Hide resolved
enhancements/etcd/etcd-size-tuning.md Outdated Show resolved Hide resolved
enhancements/etcd/etcd-size-tuning.md Outdated Show resolved Hide resolved
enhancements/etcd/etcd-size-tuning.md Show resolved Hide resolved
enhancements/etcd/etcd-size-tuning.md Show resolved Hide resolved
enhancements/etcd/etcd-size-tuning.md Show resolved Hide resolved
Copy link
Contributor

@dofinn dofinn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some notes from SRE perspective :)

enhancements/etcd/etcd-size-tuning.md Show resolved Hide resolved
enhancements/etcd/etcd-size-tuning.md Outdated Show resolved Hide resolved
enhancements/etcd/etcd-size-tuning.md Outdated Show resolved Hide resolved
enhancements/etcd/etcd-size-tuning.md Show resolved Hide resolved
enhancements/etcd/etcd-size-tuning.md Outdated Show resolved Hide resolved
enhancements/etcd/etcd-size-tuning.md Outdated Show resolved Hide resolved
enhancements/etcd/etcd-size-tuning.md Outdated Show resolved Hide resolved
enhancements/etcd/etcd-size-tuning.md Show resolved Hide resolved
enhancements/etcd/etcd-size-tuning.md Outdated Show resolved Hide resolved
@dhellmann
Copy link
Contributor

#1555 is changing the enhancement template in a way that will cause the header check in the linter job to fail for existing PRs. If this PR is merged within the development period for 4.16 you may override the linter if the only failures are caused by issues with the headers (please make sure the markdown formatting is correct). If this PR is not merged before 4.16 development closes, please update the enhancement to conform to the new template.

@dusk125 dusk125 force-pushed the etcd-sizing branch 3 times, most recently from 470df4c to 8196cb6 Compare March 13, 2024 15:33
@dusk125 dusk125 force-pushed the etcd-sizing branch 2 times, most recently from 30c0759 to e49978e Compare April 9, 2024 15:52
@dusk125
Copy link
Contributor Author

dusk125 commented Apr 11, 2024

Update for those following along at home:

Perfscale ran some tests and found that 32GiB seems to be a maximum; between 34GiB and 40GiB began to cause problems. The PR has been updated to have 32GiB as the enforced maximum.

There is a known issue in the API Server that can cause it to fall into a boot loop on restart due to watchers all trying to request large amounts of data all at the same time. This issue could be more likely to occur by larger etcd databases. The API Server team is working on addressing this issue, and this issue should be a GA blocker for this feature: the PR has been updated to include this as a requirement prior to GA.

// +kubebuilder:default=8
// +kubebuilder:validation:Maximum=32
// +kubebuilder:validation:XValidation:rule="self>=oldSelf",message="can't decrease database size"
// +openshift:enable:FeatureGates=EtcdBackendQuota
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// +openshift:enable:FeatureGates=EtcdBackendQuota
// +openshift:enable:FeatureGate=EtcdBackendQuota

see example https://github.com/openshift/api/blob/3efee42522172bc5e3c04010968ca06092f15c5b/example/v1/types_stable.go#L40-L42

Copy link
Contributor

@Elbehery Elbehery Apr 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tjungblu wdyt about the naming (i.e. FeatureGate=EtcdBackendQuota)

or shall we keep it BackendQuotaGiB as the field name ?

cc @soltysh @deads2k @hasbro17

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if there's a naming convention guideline but I think the FeatureGate is essentially just the feature name.
So I would vote to keep the name EtcdBackendQuota instead of BackendQuotaGiB (since the latter is more specific yet less indicative of what the feature is).

@dusk125 dusk125 force-pushed the etcd-sizing branch 2 times, most recently from 99c0afa to 4339685 Compare April 17, 2024 17:32
enhancements/etcd/etcd-size-tuning.md Outdated Show resolved Hide resolved
enhancements/etcd/etcd-size-tuning.md Outdated Show resolved Hide resolved
enhancements/etcd/etcd-size-tuning.md Outdated Show resolved Hide resolved
enhancements/etcd/etcd-size-tuning.md Outdated Show resolved Hide resolved
enhancements/etcd/etcd-size-tuning.md Show resolved Hide resolved
enhancements/etcd/etcd-size-tuning.md Show resolved Hide resolved
@Elbehery
Copy link
Contributor

@hasbro17 @tjungblu @soltysh @deads2k @JoelSpeed

Any more reviews, or shall we merge ?

@hasbro17
Copy link
Contributor

/approve

The enhancement looks good to me for a tech-preview implementation (sans any more API specifics which we can loop back in from openshift/api#1736).

Seems like we've explicitly mentioned the GA requirements for more performance testing, validating the API server's ability to handle large requests at the higher limit, investigating the OOM cycle scenario and documenting the recovery procedure for that.

I'll defer the LGTM to @soltysh or @deads2k in case there are more scenarios we need to callout here.

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 25, 2024
Copy link

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comments, overall lgtm

enhancements/etcd/etcd-size-tuning.md Show resolved Hide resolved
enhancements/etcd/etcd-size-tuning.md Show resolved Hide resolved
enhancements/etcd/etcd-size-tuning.md Show resolved Hide resolved
enhancements/etcd/etcd-size-tuning.md Show resolved Hide resolved
enhancements/etcd/etcd-size-tuning.md Show resolved Hide resolved
enhancements/etcd/etcd-size-tuning.md Show resolved Hide resolved
enhancements/etcd/etcd-size-tuning.md Show resolved Hide resolved
enhancements/etcd/etcd-size-tuning.md Show resolved Hide resolved
* For a larger database, defragmentation of etcd will take longer which will increase the amount of time that etcd is unavaiable for writes.
Etcd will still be available for reads during defragmentation, so there should be little impact on the apiserver's availability.
* Downgrades may be impacted if:
1. The downgrade is to a version without this feature.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably worth considering a way to resolve this problem in the future, not a blocker atm.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One potential approach would be to add a check that this is not set in a version prior to when this feature goes GA. To prevent that kind of downgrade, or even block it.

Copy link

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nits that would be nice to get resolved and it's good to go.

enhancements/etcd/etcd-size-tuning.md Show resolved Hide resolved
enhancements/etcd/etcd-size-tuning.md Outdated Show resolved Hide resolved
enhancements/etcd/etcd-size-tuning.md Outdated Show resolved Hide resolved
enhancements/etcd/etcd-size-tuning.md Outdated Show resolved Hide resolved
Copy link

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 14, 2024
Copy link
Contributor

openshift-ci bot commented May 14, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hasbro17, soltysh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Contributor

openshift-ci bot commented May 14, 2024

@dusk125: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot bot merged commit 298bc48 into openshift:master May 14, 2024
2 checks passed
@dusk125 dusk125 deleted the etcd-sizing branch May 14, 2024 15:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants