-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Item action progress monitoring design #5108
Conversation
Codecov Report
@@ Coverage Diff @@
## main #5108 +/- ##
==========================================
+ Coverage 41.30% 41.34% +0.03%
==========================================
Files 210 211 +1
Lines 18439 18442 +3
==========================================
+ Hits 7616 7624 +8
+ Misses 10253 10246 -7
- Partials 570 572 +2
Continue to review full report at Codecov.
|
Another thing I'm not clear about the design is how BIA/RIA interact with data mover. Is that part will be covered in this PR? |
@blackpiglet The proposed plugins in this design will act as the building blocks for Datamover. Basically, this will enable datamover implementation, datamover is just one use case which will be enabled by this design proposal. |
d5b89d5
to
0153274
Compare
5cf0054
to
bb4054c
Compare
@reasonerjt @Lyndon-Li I've responded to the last round of comments and updated the PR with some changes in response. |
bb4054c
to
cbca4ac
Compare
snapshot (or other item operation) ID return. The primary reasons for this change are as follows: | ||
|
||
1. The intended scope has moved beyond snapshot processing, so it makes sense to support | ||
asynchronous operations in in other backup or restore item actions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
asynchronous operations in in other backup or restore item actions. | |
asynchronous operations in other backup or restore item actions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixing
Two new methods will be added to the VolumeSnapshotter interface: | ||
|
||
Progress(snapshotID string) (OperationProgress, error) | ||
Cancel(snapshotID string) (OperationProgress, error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I missed this last time I review this PR. What should the return value OperationProgress
of func Cancel
be?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, good catch. This should not return an OperationProgress. Cancel tells the plugin "cancel this if you can, since velero won't ask again for progess". It should only return an error. I'll update the PR to reflect this (if you look at the plugin API design PRs, you'll see that Cancel does not return an OperationProgress)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other plugin types already had Cancel correctly not returning this struct.
progress, Progress or Cancel will return an InvalidOperationIDError error rather than a populated | ||
OperationProgress struct. If the item action does not start an asynchronous operation, then | ||
operationID will be empty. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also clarify the expected return values if Progress
or Cancel
is called against a canceled operation? Will it be an error or somehow reflected in the OperationProgress
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cancel just returns an error if something unexpected happens. Calling cancel again shouldn't force an error -- it's up to the plugin, but I'd expect a second Cancel call would be a no-op (and Velero shouldn't actually call it twice unless there's a bug).
For Progress, again, this shouldn't happen, and the actual behavior is up to the plugin, but I'd expect the plugin to do one of two things based on the overall design already identified:
- if the canceled operation is no longer available (i.e. the CR associated with it has been deleted by its controller), then it should return a standard "not found" error just as it would with an operationID that's never been valid
- If the canceled operation is still there, then I'd expect progress to do what it normally does -- if it's actually in progress (cancel didn't work/isn't supported), then report that. If there's an error state, then report that. If the plugin has information on completion stats, report that. I don't know that it's really a separate case that Velero should care about, since Velero will never call Progress or Cancel after calling Cancel on an operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since Velero will never call Progress or Cancel after calling Cancel on an operation.
So it means the Cancel is a sync operation right? i.e. if everything is fine it won't return until the operation is really canceled. If this is the case, could you explicitly define that in this design?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If everything is fine, Velero will never call cancel. The only time Velero will call Cancel is if we've hit the timeout and the operation is still ongoing. However, once we call cancel, since it's up to the plugin what to do with it (could be a no-op if the plugin can't cancel), once Cancel returns, velero moves on. Basically the Cancel API call is "best effort" -- Velero doesn't guarantee that the cancel was successful, since some plugins may not be able to cancel at all. Velero calls cancel and moves on. I think that was stated explicitly, but if it's not clear enough, let me know and I can try to reword that somewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sseago
Thanks for the reply.
Since you are putting details like error types in this section, IMO, the behavior of Cancel
func of a plugin should be be clarified.
I suggest we add more details in the paraphrase Velero can attempt to cancel an operation by calling the Cancel API call on the BIA/RIA....
. We should clarify the contract between the velero and the plugin in terms of Cancel
func. For example, when this function returns nil
, does it indicates (a) the operation is canceled successfully or (b) the cancel request is handled but the result is unknown?
I presonally prefer option (a), if it's not possible for the plugin to know if the operation is really canceled, it can return a status unknown error
and velero may choose to ignore it.
If you think (b) is a better choice, I'd suggest we clarify in what case the Cancel
func returns a non-nil value, or maybe it'd be better that the Cancel
func does not have any return values. So the caller will just call the func to attempt to cancel the operation, and it may or may not be successful.
Either way, I hope this can be clarified in this design, otherwise, the plugin developer will be confused when he/she is implementing the Cancel
func.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no expectation for cancel to return an error under normal operations. Since any plugin-implemented cancel request is likely to be async as well, there's no way to provide velero with reliable "this was successfully cancelled" information in the general case, since velero won't be checking back. The whole point of cancel from velero's point of view is "we hit the timeout, we're declaring this operation failed, and we're giving the plugin an opportunity to cancel on its end if it can" -- but velero is logging this as an error either way. This means the plugin doesn't need to tell velero anything. A plugin which doesn't support cancel will implement a no-op func -- returning nil for error. A cancel-supporting plugin will do whatever it needs to cancel and return nil for error (and it probably won't know at this point whether cancel was successful). An error will be returned if the plugin itself gets an unexpected error in cancellation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A plugin which doesn't support cancel will implement a no-op func -- returning nil for error. A cancel-supporting plugin will do whatever it needs to cancel and return nil for error (and it probably won't know at this point whether cancel was successful). An error will be returned if the plugin itself gets an unexpected error in cancellation.
This sounds OK to me. Let's make sure this is documented, in the .md file and in the inline comment of the go interface.
(_Completed_, _PartiallyFailed_, or _Failed_). Reconciliation should ignore backups that do not | ||
have a `velero-backup.json` object. | ||
|
||
The Backup/RestoreItemAction plugin identifier as well as the ItemID and OperationID will be stored |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory, it's possible the backup controller cancel some async operation when the backup is in InProgress
state, in that case, will the canceled operations exist in this JSON file? I think the answer should be Yes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So Velero should only call Cancel if the operation timeout has been reached -- and that will only happen during the last reconcile, just before Velero moves the backup into PartiallyFailed since some operations didn't complete in time. I'm thinking that we would apply one timeout to the full list of operations -- so we won't really be cancelling anything unless we're going through the list for the last time post-timeout-expiration. But yes, I don't think cancel removes items from the file. One thing that's not clear to me yet -- I think working through this might be part of the implementation-level task -- should this file also track completion status or not? Doing so would mean subsequent calls would only need to call the plugin again for previously incomplete operations, so that's a plus. The downside of this is we'd need to re-upload this file each time Velero reconciles again. I can see advantages to each approach here.
cbca4ac
to
9cebd5e
Compare
@reasonerjt @kaovilai @shubham-pampattiwar Updated the PR again with some minor fixes based on recent comments. I think the main unresolved issue is what to do about the OperationProgress struct fields itemsCompleted and itemsToComplete. I think we have 3 options based on the discussion above:
|
I like this
it means plugin authors can surface anything here.. and could include things like in-progress (retryable) errors encountered |
This design combines the requirements for the previously-merged Upload Progress Monitoring design with the requirements for the (not submitted but discussed in meetings and slack) proposed asynchronous item action plugins into one integrated proposal. Signed-off-by: Scott Seago <[email protected]>
9cebd5e
to
23c69f4
Compare
I've updated this to reflect recent discussions around the OperationProgress struct. We now have two int64 fields, |
@reasonerjt One reason I favor the explicit fields for completed/total (and added a units field to handle cases where the thing being measured isn't bytes) is that it is explicit in the plugin API, so velero knows that any plugins being run are going to be reporting correctly. By embedding it in json and documenting an approved format, it makes it more of a hidden API, where the plugin must follow it for stats to be displayed, but it's not enforced by the plugin API. |
@sseago |
@reasonerjt That's correct. If the fields are left blank, they should default to 0. So, if for example, a particular action is only able to be reported as "In Progress" or "Phase 2", then that would be set as the string value for |
@dsu-igeek @ywk253100 @blackpiglet @qiuming-best @Lyndon-Li |
its `velero-backup.json` object, the other objects in the object store for the backup will be | ||
effectively orphaned. This can currently happen but the current window is much smaller. | ||
|
||
### `<backup-name>-itemoperations.json.gz` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is trivial, but if the file is to be updated every time the operation's status is updated, it may be better no to gzip it.
@sseago Please ping me on slack and I'll approve again |
@reasonerjt @shubham-pampattiwar Thanks for the acks. I think I'll go ahead and merge this design as-is (since it's blocking other work), but I will make a small follow-on PR to clarify the Cancel API. |
This design combines the requirements for the previously-merged
Upload Progress Monitoring design with the requirements for the
(not submitted but discussed in meetings and slack) proposed asynchronous
item action plugins into one integrated proposal.
Signed-off-by: Scott Seago [email protected]
Thank you for contributing to Velero!
This is a design PR that is intended to combine the needs of the previous Upload Progress Monitoring (https://github.com/vmware-tanzu/velero/blob/main/design/upload-progress.md ) and the recently-discussed Asynchronous backup/restore actions into a single unified design. This design document should supercede both of the above.
Currently a Draft PR, since I still need to read through the existing upload progress implementation PR (#4467 ) to see whether there's anything I may have missed here. Also, the backup phase diagram needs to be updated, as it currently uses the earlier snapshot-specific terminology rather than the more general terms used here.
Please indicate you've done the following:
/kind changelog-not-required
as a comment on this pull request.site/content/docs/main
.