Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Item action progress monitoring design #5108

Merged
merged 1 commit into from
Dec 6, 2022

Conversation

sseago
Copy link
Collaborator

@sseago sseago commented Jul 9, 2022

This design combines the requirements for the previously-merged
Upload Progress Monitoring design with the requirements for the
(not submitted but discussed in meetings and slack) proposed asynchronous
item action plugins into one integrated proposal.

Signed-off-by: Scott Seago [email protected]

Thank you for contributing to Velero!

This is a design PR that is intended to combine the needs of the previous Upload Progress Monitoring (https://github.com/vmware-tanzu/velero/blob/main/design/upload-progress.md ) and the recently-discussed Asynchronous backup/restore actions into a single unified design. This design document should supercede both of the above.

Currently a Draft PR, since I still need to read through the existing upload progress implementation PR (#4467 ) to see whether there's anything I may have missed here. Also, the backup phase diagram needs to be updated, as it currently uses the earlier snapshot-specific terminology rather than the more general terms used here.

Please indicate you've done the following:

  • [x ] Accepted the DCO. Commits without the DCO will delay acceptance.
  • [x ] Created a changelog file or added /kind changelog-not-required as a comment on this pull request.
  • Updated the corresponding documentation in site/content/docs/main.

@sseago sseago added the kind/changelog-not-required PR does not require a user changelog. Often for docs, website, or build changes label Jul 9, 2022
@github-actions github-actions bot requested review from blackpiglet and dsu-igeek July 9, 2022 00:32
@github-actions github-actions bot added the Area/Design Design Documents label Jul 9, 2022
@sseago sseago marked this pull request as draft July 9, 2022 00:33
@codecov-commenter
Copy link

Codecov Report

Merging #5108 (21e7928) into main (267db7a) will increase coverage by 0.03%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##             main    #5108      +/-   ##
==========================================
+ Coverage   41.30%   41.34%   +0.03%     
==========================================
  Files         210      211       +1     
  Lines       18439    18442       +3     
==========================================
+ Hits         7616     7624       +8     
+ Misses      10253    10246       -7     
- Partials      570      572       +2     
Impacted Files Coverage Δ
pkg/util/kube/predicate.go 57.14% <0.00%> (ø)
...g/controller/backup_storage_location_controller.go 64.40% <0.00%> (+5.49%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 267db7a...21e7928. Read the comment docs.

@blackpiglet
Copy link
Contributor

Another thing I'm not clear about the design is how BIA/RIA interact with data mover. Is that part will be covered in this PR?

@reasonerjt
Copy link
Contributor

@sseago To follow up the discussion in community mtg, I've found the following PRs merged for the itemsnapshotter plugin proposed by Dave:

#4429
#4417
#4416
#4077
#3805

@shubham-pampattiwar
Copy link
Collaborator

Another thing I'm not clear about the design is how BIA/RIA interact with data mover. Is that part will be covered in this PR?

@blackpiglet The proposed plugins in this design will act as the building blocks for Datamover. Basically, this will enable datamover implementation, datamover is just one use case which will be enabled by this design proposal.

@sseago sseago force-pushed the progress-design branch 2 times, most recently from d5b89d5 to 0153274 Compare July 18, 2022 20:07
@sseago
Copy link
Collaborator Author

sseago commented Oct 25, 2022

@reasonerjt @Lyndon-Li I've responded to the last round of comments and updated the PR with some changes in response.

snapshot (or other item operation) ID return. The primary reasons for this change are as follows:

1. The intended scope has moved beyond snapshot processing, so it makes sense to support
asynchronous operations in in other backup or restore item actions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
asynchronous operations in in other backup or restore item actions.
asynchronous operations in other backup or restore item actions.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixing

Two new methods will be added to the VolumeSnapshotter interface:

Progress(snapshotID string) (OperationProgress, error)
Cancel(snapshotID string) (OperationProgress, error)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I missed this last time I review this PR. What should the return value OperationProgress of func Cancel be?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, good catch. This should not return an OperationProgress. Cancel tells the plugin "cancel this if you can, since velero won't ask again for progess". It should only return an error. I'll update the PR to reflect this (if you look at the plugin API design PRs, you'll see that Cancel does not return an OperationProgress)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other plugin types already had Cancel correctly not returning this struct.

progress, Progress or Cancel will return an InvalidOperationIDError error rather than a populated
OperationProgress struct. If the item action does not start an asynchronous operation, then
operationID will be empty.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also clarify the expected return values if Progress or Cancel is called against a canceled operation? Will it be an error or somehow reflected in the OperationProgress?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cancel just returns an error if something unexpected happens. Calling cancel again shouldn't force an error -- it's up to the plugin, but I'd expect a second Cancel call would be a no-op (and Velero shouldn't actually call it twice unless there's a bug).

For Progress, again, this shouldn't happen, and the actual behavior is up to the plugin, but I'd expect the plugin to do one of two things based on the overall design already identified:

  1. if the canceled operation is no longer available (i.e. the CR associated with it has been deleted by its controller), then it should return a standard "not found" error just as it would with an operationID that's never been valid
  2. If the canceled operation is still there, then I'd expect progress to do what it normally does -- if it's actually in progress (cancel didn't work/isn't supported), then report that. If there's an error state, then report that. If the plugin has information on completion stats, report that. I don't know that it's really a separate case that Velero should care about, since Velero will never call Progress or Cancel after calling Cancel on an operation.

Copy link
Contributor

@reasonerjt reasonerjt Nov 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since Velero will never call Progress or Cancel after calling Cancel on an operation.

So it means the Cancel is a sync operation right? i.e. if everything is fine it won't return until the operation is really canceled. If this is the case, could you explicitly define that in this design?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If everything is fine, Velero will never call cancel. The only time Velero will call Cancel is if we've hit the timeout and the operation is still ongoing. However, once we call cancel, since it's up to the plugin what to do with it (could be a no-op if the plugin can't cancel), once Cancel returns, velero moves on. Basically the Cancel API call is "best effort" -- Velero doesn't guarantee that the cancel was successful, since some plugins may not be able to cancel at all. Velero calls cancel and moves on. I think that was stated explicitly, but if it's not clear enough, let me know and I can try to reword that somewhere.

Copy link
Contributor

@reasonerjt reasonerjt Dec 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sseago
Thanks for the reply.

Since you are putting details like error types in this section, IMO, the behavior of Cancel func of a plugin should be be clarified.
I suggest we add more details in the paraphrase Velero can attempt to cancel an operation by calling the Cancel API call on the BIA/RIA..... We should clarify the contract between the velero and the plugin in terms of Cancel func. For example, when this function returns nil, does it indicates (a) the operation is canceled successfully or (b) the cancel request is handled but the result is unknown?

I presonally prefer option (a), if it's not possible for the plugin to know if the operation is really canceled, it can return a status unknown error and velero may choose to ignore it.
If you think (b) is a better choice, I'd suggest we clarify in what case the Cancel func returns a non-nil value, or maybe it'd be better that the Cancel func does not have any return values. So the caller will just call the func to attempt to cancel the operation, and it may or may not be successful.

Either way, I hope this can be clarified in this design, otherwise, the plugin developer will be confused when he/she is implementing the Cancel func.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no expectation for cancel to return an error under normal operations. Since any plugin-implemented cancel request is likely to be async as well, there's no way to provide velero with reliable "this was successfully cancelled" information in the general case, since velero won't be checking back. The whole point of cancel from velero's point of view is "we hit the timeout, we're declaring this operation failed, and we're giving the plugin an opportunity to cancel on its end if it can" -- but velero is logging this as an error either way. This means the plugin doesn't need to tell velero anything. A plugin which doesn't support cancel will implement a no-op func -- returning nil for error. A cancel-supporting plugin will do whatever it needs to cancel and return nil for error (and it probably won't know at this point whether cancel was successful). An error will be returned if the plugin itself gets an unexpected error in cancellation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A plugin which doesn't support cancel will implement a no-op func -- returning nil for error. A cancel-supporting plugin will do whatever it needs to cancel and return nil for error (and it probably won't know at this point whether cancel was successful). An error will be returned if the plugin itself gets an unexpected error in cancellation.

This sounds OK to me. Let's make sure this is documented, in the .md file and in the inline comment of the go interface.

(_Completed_, _PartiallyFailed_, or _Failed_). Reconciliation should ignore backups that do not
have a `velero-backup.json` object.

The Backup/RestoreItemAction plugin identifier as well as the ItemID and OperationID will be stored
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory, it's possible the backup controller cancel some async operation when the backup is in InProgress state, in that case, will the canceled operations exist in this JSON file? I think the answer should be Yes?

Copy link
Collaborator Author

@sseago sseago Nov 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So Velero should only call Cancel if the operation timeout has been reached -- and that will only happen during the last reconcile, just before Velero moves the backup into PartiallyFailed since some operations didn't complete in time. I'm thinking that we would apply one timeout to the full list of operations -- so we won't really be cancelling anything unless we're going through the list for the last time post-timeout-expiration. But yes, I don't think cancel removes items from the file. One thing that's not clear to me yet -- I think working through this might be part of the implementation-level task -- should this file also track completion status or not? Doing so would mean subsequent calls would only need to call the plugin again for previously incomplete operations, so that's a plus. The downside of this is we'd need to re-upload this file each time Velero reconciles again. I can see advantages to each approach here.

@sseago
Copy link
Collaborator Author

sseago commented Nov 1, 2022

@reasonerjt @kaovilai @shubham-pampattiwar Updated the PR again with some minor fixes based on recent comments. I think the main unresolved issue is what to do about the OperationProgress struct fields itemsCompleted and itemsToComplete. I think we have 3 options based on the discussion above:

  1. Leave it as-is
  2. Rename them to something that doesn't involve "items" to get away from possible user assumptions about the meaning of "item" (this was my suggestion)
  3. Replace them with a single string "progressSummary" -- the plugin would be responsible for saying "1000 out of 10000 bytes completed" and Velero would not attempt to parse the string, it would just take it as an opaque string.

@kaovilai
Copy link
Member

kaovilai commented Nov 9, 2022

I like this

Replace them with a single string "progressSummary"

it means plugin authors can surface anything here.. and could include things like in-progress (retryable) errors encountered

This design combines the requirements for the previously-merged
Upload Progress Monitoring design with the requirements for the
(not submitted but discussed in meetings and slack) proposed asynchronous
item action plugins into one integrated proposal.

Signed-off-by: Scott Seago <[email protected]>
@sseago
Copy link
Collaborator Author

sseago commented Nov 9, 2022

I've updated this to reflect recent discussions around the OperationProgress struct. We now have two int64 fields, Completed and Total representing the current progress of the operation (to facilitate calcuated percentage completion), with a new field OperationUnits ("bytes" for datamover and volume snapshotter plugins) to facilitate aggregation of completion totals across plugins. For plugins which are unable to furnish percentage completon, there is also a new Decription field in OperationProgress to allow for freeform progress text (perhaps phase in a workflow, etc.).

@sseago
Copy link
Collaborator Author

sseago commented Nov 10, 2022

@reasonerjt One reason I favor the explicit fields for completed/total (and added a units field to handle cases where the thing being measured isn't bytes) is that it is explicit in the plugin API, so velero knows that any plugins being run are going to be reporting correctly. By embedding it in json and documenting an approved format, it makes it more of a hidden API, where the plugin must follow it for stats to be displayed, but it's not enforced by the plugin API.

@reasonerjt
Copy link
Contributor

@sseago
Sorry for the late reply.
I agree with your concern regarding "hidden API".
So the contract is that if the plugin can't report the numeric progress the complete/total will be left as 0 right?

@sseago
Copy link
Collaborator Author

sseago commented Nov 22, 2022

@reasonerjt That's correct. If the fields are left blank, they should default to 0. So, if for example, a particular action is only able to be reported as "In Progress" or "Phase 2", then that would be set as the string value for Description, and the other fields will be left blank.

@weshayutin
Copy link
Contributor

@dsu-igeek @ywk253100 @blackpiglet @qiuming-best @Lyndon-Li
We very much need maintainers to review this PR design asap. Please help!

its `velero-backup.json` object, the other objects in the object store for the backup will be
effectively orphaned. This can currently happen but the current window is much smaller.

### `<backup-name>-itemoperations.json.gz`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is trivial, but if the file is to be updated every time the operation's status is updated, it may be better no to gzip it.

@reasonerjt
Copy link
Contributor

@sseago
I'm approving this design. If you wanna make updates to the markdown file to clarify the Cancel func following this comment

Please ping me on slack and I'll approve again

@sseago
Copy link
Collaborator Author

sseago commented Dec 6, 2022

@reasonerjt @shubham-pampattiwar Thanks for the acks. I think I'll go ahead and merge this design as-is (since it's blocking other work), but I will make a small follow-on PR to clarify the Cancel API.

@sseago sseago merged commit 2ce46bd into vmware-tanzu:main Dec 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area/Design Design Documents kind/changelog-not-required PR does not require a user changelog. Often for docs, website, or build changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants