-
Notifications
You must be signed in to change notification settings - Fork 378
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove createSnapshotContentRetryCount and createSnapshotContentInterval #211
Remove createSnapshotContentRetryCount and createSnapshotContentInterval #211
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: xing-yang The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/assign @yuxiangqian |
@@ -42,6 +42,12 @@ import ( | |||
"k8s.io/kubernetes/pkg/util/goroutinemap" | |||
) | |||
|
|||
// Number of retries when we create a snapshot content object for a snapshot. | |||
const createSnapshotContentRetryCount = 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my higher level question is why do we need to retry within the workqueue func? Can we just re-add the snapshot to the queue for retry whenever there's a failure, and let the workqueue properly throttle the retries?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought this is recommended practice for API update? PV controller also retries when PV object fails to be created. I remember when I first started to work on snapshot in external-storage repo, I was asked to add this type of retry as a bug fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe PV controller does it because of the possibility of leaking the volume. I don't think we have the same issue here since we are creating the content object first before calling CSI. Cc @jsafrane
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we have the same issue here since we are creating the content object first before calling CSI
+1, in PV controller, a volume is already created on storage backend when the controller saves a PV. And it really needs to save it, otherwise the volume may be leaked.
In snapshots, nothing is leaked if SnapshotContent is not created, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, in the beta version, we changed the behavior and create the VolumeSnapshotContent first before creating the physical snapshot resource.
We are also doing retries when update snapshot fails. In general, is it recommended practice to do retries when update an API object fails?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The general pattern is to requeue it on any failure and let workqueue schedule the retry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the RetryCount.
from command line options
91aa621
to
60c696c
Compare
if err != nil { | ||
klog.V(4).Infof("failed to update snapshot %s status: %v", utils.SnapshotKey(snapshot), err) | ||
// update snapshot status failed | ||
ctrl.updateSnapshotErrorStatusWithEvent(snapshot, v1.EventTypeWarning, "SnapshotStatusUpdateFailed", fmt.Sprintf("Snapshot status update failed, %v", err)) | ||
return err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the caller, I see that we don't requeue the object and instead rely on resync period to retry. I think we should change the behavior to requeue, but we should also make sure the workqueue has proper throttling/backoff as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use queue.AddRateLimited(key)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you take a look of this? 10017c2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I like this direction. Also cc @yuxiangqian and @jsafrane for more eyes. This will cause snapshotter to potentially retry more frequently than it has in the past, although it should be rate limited.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should delay this change. This changes snapshot controller behavior a lot. In the past we don't do retries when create snapshot fails because snapshot is time sensitive (other than a few retries when creating/updating API objects).
Can we revisit this after the 2.0 release? We have a pending fix for the create snapshot timeout issue as well that we should go back to after 2.0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A little concerned about removing retrying logic for status update and snapshot content creation from application perspective. Do we have a rough idea how often retry would resolve API server failures? Application could end up based of the existence of status field of a snapshot to decide whether or not to unquiesce. With RateLimited requeue, the waiting time could be significantly longer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed snapshotQueue.AddRateLimited().
What is conclusion here? Should I just add back createSnapshotContentRetryCount() as a internal const or is the current PR okay? @msau42
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like having retries within retries. It makes timing hard to reason about (metrics look strange), and we can end up overloading the API server since there's two different throttling behaviors that need to be controlled. We previously had issues in external-provisioner due to it having multiple retries within retries and I'd like to avoid that here too. We can tune the rate limiting as necessary.
But I agree this is a large, fundamental change to how the snapshot controller has operated, so I'm ok deferring it to the next release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, let's add the retry count back for now so the only change is not making the retry count configurable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
@@ -365,8 +362,10 @@ func (ctrl *csiSnapshotCommonController) updateSnapshot(snapshot *crdv1.VolumeSn | |||
klog.V(3).Infof("could not sync claim %q: %+v", utils.SnapshotKey(snapshot), err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't want to return err in this case either?
c029875
to
0dd03c9
Compare
/lgtm |
What type of PR is this?
/kind cleanup
What this PR does / why we need it:
Removes createSnapshotContentRetryCount and createSnapshotContentInterval
from command line options.
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
This was brought up during the review of the README PR (#206) that these parameters should just be internal variables, not command line options.
Does this PR introduce a user-facing change?: