Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update sidecar timeout values #1824

Merged
merged 1 commit into from
Nov 10, 2023

Conversation

torredil
Copy link
Member

@torredil torredil commented Nov 1, 2023

What is this PR about? / Why do we need it?

  • This PR changes the default timeout value (15s) for the attacher sidecar, responsible for exercising ControllerPublishVolume / ControllerUnpublishVolume. The default value of 15s used today is not a sensible default, as a result the following error is observed frequently:
E1101 14:21:54.345145       1 driver.go:124] "GRPC error" err="rpc error: code = Internal desc = Could not detach volume \"vol-07ec886465a316b9b\" from node \"i-0e137107f5012fb1f\": context deadline exceeded"

In the vast majority of cases the volume is successfully detached just mere seconds after the attacher times out.

closes #1671

For context:

--timeout <duration>: Timeout of all calls to CSI driver. It should be set to value that accommodates majority of ControllerPublish and ControllerUnpublish calls. See [CSI error and timeout handling](https://github.com/kubernetes-csi/external-attacher#csi-error-and-timeout-handling) for details. 15 seconds is used by default.

What testing is done?

  • Manual testing
  • CI

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Nov 1, 2023
@k8s-ci-robot k8s-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Nov 1, 2023
@torredil torredil changed the title Update attacher/provisioner RPC call timeout Update csi-attacher timeout value Nov 1, 2023
@ConnorJC3 ConnorJC3 force-pushed the master branch 2 times, most recently from 24a8e7b to bddbe0b Compare November 1, 2023 18:08
Copy link
Contributor

@AndrewSirenko AndrewSirenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
Left non-blocking comment.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 1, 2023
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 1, 2023
@AndrewSirenko
Copy link
Contributor

/retest
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 1, 2023
Copy link
Contributor

@wmesard wmesard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When an operation does take a long time, how will the cx change as a result of this PR? Fewer annoying error messages in the log, sure. But what else? Will it reduce the number of AWS API calls? Increase it? Will it reduce the length of time that it takes for K8s to notice that the volume attachment state has changed? Increase it?

@ConnorJC3
Copy link
Contributor

How will this behave if the user is already passing --timeout via additionalArgs? Do we need a guard against that?

@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Nov 9, 2023
@torredil torredil changed the title Update csi-attacher timeout value Update sidecar timeout values Nov 9, 2023
@AndrewSirenko
Copy link
Contributor

/lgtm
as two way door

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 10, 2023
Copy link
Contributor

@ConnorJC3 ConnorJC3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing snapshotter?

Signed-off-by: Eddie Torres <[email protected]>
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 10, 2023
@torredil
Copy link
Member Author

Missing snapshotter?

The snapshotter sidecar already uses a timeout value of 60s by default.

@ConnorJC3
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 10, 2023
@torredil
Copy link
Member Author

The primary goal of adjusting the timeout values in this PR is to prevent premature context cancellations for CSI operations. With longer timeout values, Kubernetes (more specifically the sidecars) will wait a longer duration (60s) before retrying cancelled operations - this does not inherently change the speed at which (as an example) the volume attachment state changes - it simply provides the respective EC2 API call more time to complete before the sidecar retries.

In terms of operational performance, the increased timeout values would result in delays for operations that are genuinely stuck (which would likely be an indicator of a bug or inefficiency in the driver code. In this regard, increasing the timeout values improves the resiliency of our code because bugs can hide underneath the current set of values). However, in all other cases the operational performance is improved as the state is updated sooner because there is no need to wait for a retry.

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: torredil

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 10, 2023
@k8s-ci-robot k8s-ci-robot merged commit 479f6e8 into kubernetes-sigs:master Nov 10, 2023
jsafrane added a commit to jsafrane/csi-operator that referenced this pull request Nov 21, 2023
Update the operator to use the same sidecar arguments (timeouts, QPS,
worker threads) as upstream.

See kubernetes-sigs/aws-ebs-csi-driver#1824 and
kubernetes-sigs/aws-ebs-csi-driver#1824.
jsafrane added a commit to jsafrane/csi-operator that referenced this pull request Nov 21, 2023
Update the operator to use the same sidecar arguments (timeouts, QPS,
worker threads) as upstream.

See kubernetes-sigs/aws-ebs-csi-driver#1824 and
kubernetes-sigs/aws-ebs-csi-driver#1824.
jsafrane added a commit to jsafrane/csi-operator that referenced this pull request Nov 21, 2023
Update the operator to use the same sidecar arguments (timeouts, QPS,
worker threads) as upstream.

See kubernetes-sigs/aws-ebs-csi-driver#1824 and
kubernetes-sigs/aws-ebs-csi-driver#1824.
jsafrane added a commit to jsafrane/csi-operator that referenced this pull request Nov 21, 2023
Update the operator to use the same sidecar arguments (timeouts, QPS,
worker threads) as upstream.

See kubernetes-sigs/aws-ebs-csi-driver#1824 and
kubernetes-sigs/aws-ebs-csi-driver#1824.
jsafrane added a commit to jsafrane/csi-operator that referenced this pull request Nov 21, 2023
Update the operator to use the same sidecar arguments (timeouts, QPS,
worker threads) as upstream.

See kubernetes-sigs/aws-ebs-csi-driver#1824 and
kubernetes-sigs/aws-ebs-csi-driver#1824.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CSI RPC call timeouts are non-optimal
6 participants