-
Notifications
You must be signed in to change notification settings - Fork 336
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scaling issue : When 100+ PVCs are created at the same time, the CSI external-provisioner experiences timeout toward the plugin #211
Comments
I think this is expected behavior. It seems that your CSI driver does not cope with the amount of requests and the provisioner times out waiting for CreateVolume response. The provisioner should retry in a while (but see #154, that one should be fixed). Will your volumes get provisioned in the end? How long does it take? How many CreateVolume requests can your CSI driver handle per second? What's version of your external-provisioner image? Sharing full logs would be useful. You can configure 10s timeout with |
I remember trying to reconfigure the |
With rebase to external-provisioner-lib 3.0 and recent changes in external-provisioner, the retry mechanism is different. Details will be documented in README that's being updated in #249 We now have exponential backoff with fast retries in the beginning, slowing down quickly and with no explicit number of attempts. Kubernetes API server and/or storage subsystem may be busy and return errors, but the controller should recover and provision everything at fast as API server and the storage is capable. There are some new cmdline arguments to tune things up for slower storage. These PRs brought most of the fixes: /close |
@jsafrane: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
To sum it up, I tested with 200 volumes on AWS.
The new provisioner is still a bit faster than the old one because provisioning is slower than deletion and its speedup is more thus significant, however, the major benefit is that its robustness has improved a lot and also it's more configurable to various storage backends. |
I noticed that when creating more than 100 PVCs at the same time, the CSI external-provisioner experiences timeout toward the plugin and eventually timeout the entire operation which ends up with pending pods without persistent volume attached.
{"log":"I0108 09:55:25.485371 1 controller.go:544] CreateVolumeRequest {Name:pvc-74c33958-132b-11e9-bdce-107d1a595b9b CapacityRange:required_bytes:1073741824000 VolumeCapabilities:[mount:\u003cfs_type:"ext4" \u003e access_mode:\u003cmode:SINGLE_NODE_WRITER \u003e ] Parameters:map[] Secrets:map[] VolumeContentSource:\u003cnil\u003e AccessibilityRequirements:\u003cnil\u003e XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}\n","stream":"stderr","time":"2019-01-08T09:55:25.48549066Z"}
{"log":"I0108 09:55:25.485464 1 controller.go:188] GRPC call: /csi.v1.Controller/CreateVolume\n","stream":"stderr","time":"2019-01-08T09:55:25.485508387Z"}
{"log":"I0108 09:55:25.485481 1 controller.go:189] GRPC request: {"capacity_range":{"required_bytes":1073741824000},"name":"pvc-74c33958-132b-11e9-bdce-107d1a595b9b","volume_capabilities":[{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}}]}\n","stream":"stderr","time":"2019-01-08T09:55:25.48800868Z"}
{"log":"I0108 09:55:32.166606 1 controller.go:191] GRPC response: {}\n","stream":"stderr","time":"2019-01-08T09:55:32.167352526Z"}
{"log":"I0108 09:55:32.167294 1 controller.go:192] GRPC error: rpc error: code = DeadlineExceeded desc = context deadline exceeded\n","stream":"stderr","time":"2019-01-08T09:55:32.167376498Z"}
{"log":"W0108 09:55:32.167313 1 controller.go:592] CreateVolume timeout: 10s has expired, operation will be retried\n","stream":"stderr","time":"2019-01-08T09:55:32.167383366Z"}
The text was updated successfully, but these errors were encountered: