Scaling issue : When 100+ PVCs are created at the same time, the CSI external-provisioner experiences timeout toward the plugin #211

tomerna · 2019-01-09T14:20:12Z

I noticed that when creating more than 100 PVCs at the same time, the CSI external-provisioner experiences timeout toward the plugin and eventually timeout the entire operation which ends up with pending pods without persistent volume attached.

{"log":"I0108 09:55:25.485371 1 controller.go:544] CreateVolumeRequest {Name:pvc-74c33958-132b-11e9-bdce-107d1a595b9b CapacityRange:required_bytes:1073741824000 VolumeCapabilities:[mount:\u003cfs_type:"ext4" \u003e access_mode:\u003cmode:SINGLE_NODE_WRITER \u003e ] Parameters:map[] Secrets:map[] VolumeContentSource:\u003cnil\u003e AccessibilityRequirements:\u003cnil\u003e XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}\n","stream":"stderr","time":"2019-01-08T09:55:25.48549066Z"}
{"log":"I0108 09:55:25.485464 1 controller.go:188] GRPC call: /csi.v1.Controller/CreateVolume\n","stream":"stderr","time":"2019-01-08T09:55:25.485508387Z"}
{"log":"I0108 09:55:25.485481 1 controller.go:189] GRPC request: {"capacity_range":{"required_bytes":1073741824000},"name":"pvc-74c33958-132b-11e9-bdce-107d1a595b9b","volume_capabilities":[{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}}]}\n","stream":"stderr","time":"2019-01-08T09:55:25.48800868Z"}
{"log":"I0108 09:55:32.166606 1 controller.go:191] GRPC response: {}\n","stream":"stderr","time":"2019-01-08T09:55:32.167352526Z"}
{"log":"I0108 09:55:32.167294 1 controller.go:192] GRPC error: rpc error: code = DeadlineExceeded desc = context deadline exceeded\n","stream":"stderr","time":"2019-01-08T09:55:32.167376498Z"}
{"log":"W0108 09:55:32.167313 1 controller.go:592] CreateVolume timeout: 10s has expired, operation will be retried\n","stream":"stderr","time":"2019-01-08T09:55:32.167383366Z"}

jsafrane · 2019-01-15T16:26:07Z

I think this is expected behavior. It seems that your CSI driver does not cope with the amount of requests and the provisioner times out waiting for CreateVolume response. The provisioner should retry in a while (but see #154, that one should be fixed).

Will your volumes get provisioned in the end? How long does it take? How many CreateVolume requests can your CSI driver handle per second? What's version of your external-provisioner image? Sharing full logs would be useful.

You can configure 10s timeout with --connection-timeout of the provisioner.

msau42 · 2019-01-15T16:30:20Z

I remember trying to reconfigure the connection-timeout to more than a minute on the provisioner, but ran into issues because it seems like that timeout parameter is also controlling the initial RPC socket retry timeout. I think we might need to split out those two timeout values. Or maybe something like kubernetes-csi/driver-registrar#78

jsafrane · 2019-03-14T13:50:42Z

With rebase to external-provisioner-lib 3.0 and recent changes in external-provisioner, the retry mechanism is different. Details will be documented in README that's being updated in #249

We now have exponential backoff with fast retries in the beginning, slowing down quickly and with no explicit number of attempts. Kubernetes API server and/or storage subsystem may be busy and return errors, but the controller should recover and provision everything at fast as API server and the storage is capable. There are some new cmdline arguments to tune things up for slower storage.

These PRs brought most of the fixes:

/close

k8s-ci-robot · 2019-03-14T13:50:46Z

@jsafrane: Closing this issue.

In response to this:

With rebase to external-provisioner-lib 3.0 and recent changes in external-provisioner, the retry mechanism is different. Details will be documented in README that's being updated in #249

We now have exponential backoff with fast retries in the beginning, slowing down quickly and with no explicit number of attempts. Kubernetes API server and/or storage subsystem may be busy and return errors, but the controller should recover and provision everything at fast as API server and the storage is capable. There are some new cmdline arguments to tune things up for slower storage.

These PRs brought most of the fixes:

Update lib-external-provisioner and cleanup constraints #244

Remove internal retry loop #216

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jsafrane · 2019-03-14T13:53:51Z

To sum it up, I tested with 200 volumes on AWS.

The old provisioner (1.0.1) was 2x slower with provisioning than the new one, mainly because now we have 100 workers instead of 4 (i.e. 100 AWS calls are done in parallel instead of 4). The new one could be even faster if it was not throttled by API server.
Old provisioner is 2x faster than the new one, because API server does not like the rate of PV updates / deletes and throttles the provisioner heavily (it's still recovering from provisioning load).

The new provisioner is still a bit faster than the old one because provisioning is slower than deletion and its speedup is more thus significant, however, the major benefit is that its robustness has improved a lot and also it's more configurable to various storage backends.

k8s-ci-robot closed this as completed Mar 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scaling issue : When 100+ PVCs are created at the same time, the CSI external-provisioner experiences timeout toward the plugin #211

Scaling issue : When 100+ PVCs are created at the same time, the CSI external-provisioner experiences timeout toward the plugin #211

tomerna commented Jan 9, 2019 •

edited

Loading

jsafrane commented Jan 15, 2019

msau42 commented Jan 15, 2019

jsafrane commented Mar 14, 2019

k8s-ci-robot commented Mar 14, 2019

jsafrane commented Mar 14, 2019

Scaling issue : When 100+ PVCs are created at the same time, the CSI external-provisioner experiences timeout toward the plugin #211

Scaling issue : When 100+ PVCs are created at the same time, the CSI external-provisioner experiences timeout toward the plugin #211

Comments

tomerna commented Jan 9, 2019 • edited Loading

jsafrane commented Jan 15, 2019

msau42 commented Jan 15, 2019

jsafrane commented Mar 14, 2019

k8s-ci-robot commented Mar 14, 2019

jsafrane commented Mar 14, 2019

tomerna commented Jan 9, 2019 •

edited

Loading