-
Notifications
You must be signed in to change notification settings - Fork 814
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update sidecar timeout values #1824
Conversation
0399d2a
to
93f80d0
Compare
24a8e7b
to
bddbe0b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
Left non-blocking comment.
93f80d0
to
0360695
Compare
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When an operation does take a long time, how will the cx change as a result of this PR? Fewer annoying error messages in the log, sure. But what else? Will it reduce the number of AWS API calls? Increase it? Will it reduce the length of time that it takes for K8s to notice that the volume attachment state has changed? Increase it?
How will this behave if the user is already passing |
0360695
to
741a3f4
Compare
/lgtm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing snapshotter?
Signed-off-by: Eddie Torres <[email protected]>
741a3f4
to
5ed15f7
Compare
The snapshotter sidecar already uses a timeout value of 60s by default. |
/lgtm |
The primary goal of adjusting the timeout values in this PR is to prevent premature context cancellations for CSI operations. With longer timeout values, Kubernetes (more specifically the sidecars) will wait a longer duration (60s) before retrying cancelled operations - this does not inherently change the speed at which (as an example) the volume attachment state changes - it simply provides the respective EC2 API call more time to complete before the sidecar retries. In terms of operational performance, the increased timeout values would result in delays for operations that are genuinely stuck (which would likely be an indicator of a bug or inefficiency in the driver code. In this regard, increasing the timeout values improves the resiliency of our code because bugs can hide underneath the current set of values). However, in all other cases the operational performance is improved as the state is updated sooner because there is no need to wait for a retry. /approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: torredil The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Update the operator to use the same sidecar arguments (timeouts, QPS, worker threads) as upstream. See kubernetes-sigs/aws-ebs-csi-driver#1824 and kubernetes-sigs/aws-ebs-csi-driver#1824.
Update the operator to use the same sidecar arguments (timeouts, QPS, worker threads) as upstream. See kubernetes-sigs/aws-ebs-csi-driver#1824 and kubernetes-sigs/aws-ebs-csi-driver#1824.
Update the operator to use the same sidecar arguments (timeouts, QPS, worker threads) as upstream. See kubernetes-sigs/aws-ebs-csi-driver#1824 and kubernetes-sigs/aws-ebs-csi-driver#1824.
Update the operator to use the same sidecar arguments (timeouts, QPS, worker threads) as upstream. See kubernetes-sigs/aws-ebs-csi-driver#1824 and kubernetes-sigs/aws-ebs-csi-driver#1824.
Update the operator to use the same sidecar arguments (timeouts, QPS, worker threads) as upstream. See kubernetes-sigs/aws-ebs-csi-driver#1824 and kubernetes-sigs/aws-ebs-csi-driver#1824.
What is this PR about? / Why do we need it?
ControllerPublishVolume
/ControllerUnpublishVolume
. The default value of 15s used today is not a sensible default, as a result the following error is observed frequently:In the vast majority of cases the volume is successfully detached just mere seconds after the attacher times out.
closes #1671
For context:
What testing is done?