-
Notifications
You must be signed in to change notification settings - Fork 483
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ETCD-514: Add etcd size tuning #1549
Conversation
@dusk125: This pull request references ETCD-514 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gets the general idea across but need to mention some specifics on the API. This should streamline the actual API PR if we get those details outlined here first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some notes from SRE perspective :)
#1555 is changing the enhancement template in a way that will cause the header check in the linter job to fail for existing PRs. If this PR is merged within the development period for 4.16 you may override the linter if the only failures are caused by issues with the headers (please make sure the markdown formatting is correct). If this PR is not merged before 4.16 development closes, please update the enhancement to conform to the new template. |
470df4c
to
8196cb6
Compare
30c0759
to
e49978e
Compare
Update for those following along at home: Perfscale ran some tests and found that 32GiB seems to be a maximum; between 34GiB and 40GiB began to cause problems. The PR has been updated to have 32GiB as the enforced maximum. There is a known issue in the API Server that can cause it to fall into a boot loop on restart due to watchers all trying to request large amounts of data all at the same time. This issue could be more likely to occur by larger etcd databases. The API Server team is working on addressing this issue, and this issue should be a GA blocker for this feature: the PR has been updated to include this as a requirement prior to GA. |
// +kubebuilder:default=8 | ||
// +kubebuilder:validation:Maximum=32 | ||
// +kubebuilder:validation:XValidation:rule="self>=oldSelf",message="can't decrease database size" | ||
// +openshift:enable:FeatureGates=EtcdBackendQuota |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// +openshift:enable:FeatureGates=EtcdBackendQuota | |
// +openshift:enable:FeatureGate=EtcdBackendQuota |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if there's a naming convention guideline but I think the FeatureGate is essentially just the feature name.
So I would vote to keep the name EtcdBackendQuota
instead of BackendQuotaGiB
(since the latter is more specific yet less indicative of what the feature is).
99c0afa
to
4339685
Compare
@hasbro17 @tjungblu @soltysh @deads2k @JoelSpeed Any more reviews, or shall we merge ? |
/approve The enhancement looks good to me for a tech-preview implementation (sans any more API specifics which we can loop back in from openshift/api#1736). Seems like we've explicitly mentioned the GA requirements for more performance testing, validating the API server's ability to handle large requests at the higher limit, investigating the OOM cycle scenario and documenting the recovery procedure for that. I'll defer the LGTM to @soltysh or @deads2k in case there are more scenarios we need to callout here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor comments, overall lgtm
* For a larger database, defragmentation of etcd will take longer which will increase the amount of time that etcd is unavaiable for writes. | ||
Etcd will still be available for reads during defragmentation, so there should be little impact on the apiserver's availability. | ||
* Downgrades may be impacted if: | ||
1. The downgrade is to a version without this feature. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably worth considering a way to resolve this problem in the future, not a blocker atm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One potential approach would be to add a check that this is not set in a version prior to when this feature goes GA. To prevent that kind of downgrade, or even block it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nits that would be nice to get resolved and it's good to go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: hasbro17, soltysh The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@dusk125: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
This enhancement describes allowing customers to tweak etcd backend quota in a controlled manor.