Remove createSnapshotContentRetryCount and createSnapshotContentInterval #211

xing-yang · 2019-12-17T01:46:13Z

What type of PR is this?
/kind cleanup

What this PR does / why we need it:
Removes createSnapshotContentRetryCount and createSnapshotContentInterval
from command line options.

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:
This was brought up during the review of the README PR (#206) that these parameters should just be internal variables, not command line options.

Does this PR introduce a user-facing change?:

Removes createSnapshotContentRetryCount and createSnapshotContentInterval
from command line options.

k8s-ci-robot · 2019-12-17T01:46:22Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: xing-yang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [xing-yang]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

xing-yang · 2019-12-17T01:47:03Z

/assign @yuxiangqian

msau42 · 2019-12-17T02:41:39Z

pkg/common-controller/snapshot_controller_base.go

@@ -42,6 +42,12 @@ import (
 	"k8s.io/kubernetes/pkg/util/goroutinemap"
 )

+// Number of retries when we create a snapshot content object for a snapshot.
+const createSnapshotContentRetryCount = 5


my higher level question is why do we need to retry within the workqueue func? Can we just re-add the snapshot to the queue for retry whenever there's a failure, and let the workqueue properly throttle the retries?

I thought this is recommended practice for API update? PV controller also retries when PV object fails to be created. I remember when I first started to work on snapshot in external-storage repo, I was asked to add this type of retry as a bug fix.

I believe PV controller does it because of the possibility of leaking the volume. I don't think we have the same issue here since we are creating the content object first before calling CSI. Cc @jsafrane

I don't think we have the same issue here since we are creating the content object first before calling CSI

+1, in PV controller, a volume is already created on storage backend when the controller saves a PV. And it really needs to save it, otherwise the volume may be leaked.

In snapshots, nothing is leaked if SnapshotContent is not created, right?

Yes, in the beta version, we changed the behavior and create the VolumeSnapshotContent first before creating the physical snapshot resource.

We are also doing retries when update snapshot fails. In general, is it recommended practice to do retries when update an API object fails?

The general pattern is to requeue it on any failure and let workqueue schedule the retry.

Removed the RetryCount.

from command line options

msau42 · 2019-12-18T19:48:48Z

pkg/common-controller/snapshot_controller.go

 		if err != nil {
+			klog.V(4).Infof("failed to update snapshot %s status: %v", utils.SnapshotKey(snapshot), err)
 			// update snapshot status failed
 			ctrl.updateSnapshotErrorStatusWithEvent(snapshot, v1.EventTypeWarning, "SnapshotStatusUpdateFailed", fmt.Sprintf("Snapshot status update failed, %v", err))
 			return err


Looking at the caller, I see that we don't requeue the object and instead rely on resync period to retry. I think we should change the behavior to requeue, but we should also make sure the workqueue has proper throttling/backoff as well.

Use queue.AddRateLimited(key)?

Can you take a look of this? 10017c2

Yes, I like this direction. Also cc @yuxiangqian and @jsafrane for more eyes. This will cause snapshotter to potentially retry more frequently than it has in the past, although it should be rate limited.

I wonder if we should delay this change. This changes snapshot controller behavior a lot. In the past we don't do retries when create snapshot fails because snapshot is time sensitive (other than a few retries when creating/updating API objects).

Can we revisit this after the 2.0 release? We have a pending fix for the create snapshot timeout issue as well that we should go back to after 2.0.

A little concerned about removing retrying logic for status update and snapshot content creation from application perspective. Do we have a rough idea how often retry would resolve API server failures? Application could end up based of the existence of status field of a snapshot to decide whether or not to unquiesce. With RateLimited requeue, the waiting time could be significantly longer?

Removed snapshotQueue.AddRateLimited().

What is conclusion here? Should I just add back createSnapshotContentRetryCount() as a internal const or is the current PR okay? @msau42

I don't like having retries within retries. It makes timing hard to reason about (metrics look strange), and we can end up overloading the API server since there's two different throttling behaviors that need to be controlled. We previously had issues in external-provisioner due to it having multiple retries within retries and I'd like to avoid that here too. We can tune the rate limiting as necessary.

But I agree this is a large, fundamental change to how the snapshot controller has operated, so I'm ok deferring it to the next release.

Yes, let's add the retry count back for now so the only change is not making the retry count configurable.

msau42 · 2019-12-19T01:50:55Z

pkg/common-controller/snapshot_controller_base.go

@@ -365,8 +362,10 @@ func (ctrl *csiSnapshotCommonController) updateSnapshot(snapshot *crdv1.VolumeSn
 			klog.V(3).Infof("could not sync claim %q: %+v", utils.SnapshotKey(snapshot), err)


we don't want to return err in this case either?

msau42 · 2019-12-20T02:50:42Z

/lgtm

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Dec 17, 2019

k8s-ci-robot requested review from jingxu97 and yuxiangqian December 17, 2019 01:46

k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Dec 17, 2019

k8s-ci-robot assigned yuxiangqian Dec 17, 2019

msau42 reviewed Dec 17, 2019

View reviewed changes

Remove createSnapshotContentRetryCount and createSnapshotContentInterval

60c696c

from command line options

xing-yang force-pushed the remove_retrycount branch from 91aa621 to 60c696c Compare December 18, 2019 02:54

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Dec 18, 2019

msau42 reviewed Dec 18, 2019

View reviewed changes

Add snapshot back to queue

10017c2

msau42 reviewed Dec 19, 2019

View reviewed changes

xing-yang mentioned this pull request Dec 19, 2019

Requeue the snapshot object #214

Closed

Removed snapshotQueue.AddRateLimited

c5d6e81

k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Dec 20, 2019

Add original retries back

0dd03c9

xing-yang force-pushed the remove_retrycount branch from c029875 to 0dd03c9 Compare December 20, 2019 02:36

k8s-ci-robot assigned msau42 Dec 20, 2019

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 20, 2019

k8s-ci-robot merged commit bbbadbc into kubernetes-csi:master Dec 20, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove createSnapshotContentRetryCount and createSnapshotContentInterval #211

Remove createSnapshotContentRetryCount and createSnapshotContentInterval #211

xing-yang commented Dec 17, 2019 •

edited

Loading

k8s-ci-robot commented Dec 17, 2019

xing-yang commented Dec 17, 2019

msau42 Dec 17, 2019

xing-yang Dec 17, 2019

msau42 Dec 17, 2019

jsafrane Dec 17, 2019

xing-yang Dec 17, 2019

msau42 Dec 17, 2019

xing-yang Dec 18, 2019

msau42 Dec 18, 2019

xing-yang Dec 18, 2019

xing-yang Dec 18, 2019

msau42 Dec 19, 2019

xing-yang Dec 19, 2019

yuxiangqian Dec 19, 2019

xing-yang Dec 19, 2019

msau42 Dec 19, 2019

msau42 Dec 19, 2019

xing-yang Dec 19, 2019

msau42 Dec 19, 2019

msau42 commented Dec 20, 2019

		@@ -365,8 +362,10 @@ func (ctrl csiSnapshotCommonController) updateSnapshot(snapshot crdv1.VolumeSn
		klog.V(3).Infof("could not sync claim %q: %+v", utils.SnapshotKey(snapshot), err)

Remove createSnapshotContentRetryCount and createSnapshotContentInterval #211

Remove createSnapshotContentRetryCount and createSnapshotContentInterval #211

Conversation

xing-yang commented Dec 17, 2019 • edited Loading

k8s-ci-robot commented Dec 17, 2019

xing-yang commented Dec 17, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

msau42 commented Dec 20, 2019

xing-yang commented Dec 17, 2019 •

edited

Loading