Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BIAv2 async operations controller work #5849

Merged

Conversation

sseago
Copy link
Collaborator

@sseago sseago commented Feb 8, 2023

Signed-off-by: Scott Seago [email protected]

Thank you for contributing to Velero!

Please add a summary of your change

Does your change fix a particular issue?

Fixes #5490, #5108

Please indicate you've done the following:

  • [x ] Accepted the DCO. Commits without the DCO will delay acceptance.
  • [x ] Created a changelog file or added /kind changelog-not-required as a comment on this pull request.
  • Updated the corresponding documentation in site/content/docs/main.

@sseago sseago marked this pull request as draft February 8, 2023 20:09
@sseago
Copy link
Collaborator Author

sseago commented Feb 8, 2023

Currently marked as draft, awaiting two things:

  1. I need to see what doc updates are needed, at a minimum documenting the new CRD fields and new server config params
  2. I need to test this with a real sample plugin to make sure that the new controller works properly in practice, although unit tests are currently passing

EIther way, feel free to start reviewing the PR, as the code itself should be basically done except for any bugs I find during testing.

@codecov-commenter
Copy link

codecov-commenter commented Feb 8, 2023

Codecov Report

Merging #5849 (76a2eed) into main (c6fba55) will decrease coverage by 0.22%.
The diff coverage is 38.09%.

@@            Coverage Diff             @@
##             main    #5849      +/-   ##
==========================================
- Coverage   39.94%   39.73%   -0.22%     
==========================================
  Files         253      255       +2     
  Lines       22259    23128     +869     
==========================================
+ Hits         8892     9189     +297     
- Misses      12716    13246     +530     
- Partials      651      693      +42     
Impacted Files Coverage Δ
internal/hook/item_hook_handler.go 87.11% <0.00%> (-0.81%) ⬇️
pkg/backup/backup.go 52.58% <0.00%> (-26.30%) ⬇️
pkg/builder/backup_builder.go 0.00% <0.00%> (ø)
pkg/cmd/cli/backup/logs.go 0.00% <0.00%> (ø)
pkg/cmd/server/server.go 6.14% <0.00%> (-0.49%) ⬇️
pkg/cmd/util/output/backup_describer.go 0.00% <0.00%> (ø)
pkg/persistence/object_store.go 52.65% <0.00%> (-0.57%) ⬇️
...k/backupitemaction/v2/backup_item_action_client.go 0.00% <0.00%> (ø)
pkg/backup/item_collector.go 54.01% <13.63%> (-5.87%) ⬇️
...k/backupitemaction/v2/backup_item_action_server.go 31.29% <16.66%> (-1.25%) ⬇️
... and 15 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@sseago sseago force-pushed the async-backup-controller branch from 0784837 to 72e3a06 Compare February 8, 2023 20:35
@Lyndon-Li Lyndon-Li self-assigned this Feb 9, 2023
@sseago sseago force-pushed the async-backup-controller branch 4 times, most recently from e23d5a6 to ae5bb59 Compare February 14, 2023 20:17
@sseago sseago marked this pull request as ready for review February 14, 2023 20:19
@sseago sseago force-pushed the async-backup-controller branch from ae5bb59 to ecbafce Compare February 16, 2023 00:37
@sseago
Copy link
Collaborator Author

sseago commented Feb 16, 2023

I've done a first pass, making the changes as suggested above:

  1. introduced new backup phases: FinalizingAfterPluginOperations and FinalizingAfterPluginOperationsPartiallyFailed
  2. Updated the biav2 Execute func to return a list of ResourceIdentifiers if the plugin's async operation will be updating an item that needs to be updated in the tarball at the end. This is optional, so plugins which don't need this status update can return nil (skipping the new phases)
  3. Updated the itemoperation json format to track the itemsToUpdate from the plugin Execute
  4. Updated the async backup operations controller to check for itemsToUpdate before changing the Backup phase, and if there are any items, then update the phase to FinalizingAfterPluginOperations instead of Completed.

What's not yet done is the controller work to process the finalize phase. That's next.

@sseago
Copy link
Collaborator Author

sseago commented Feb 16, 2023

upon initial investigation into the right way to handle the finalizing phase:

  1. I actually think the cleanest solution may be to do this work in the async update controller and not in the backup controller. 95% of the stuff we're doing in backup controller is unnecessary here (generating object list, dealing with restic, snapshots, etc. The only thing we need is the item backupper, but that's not even in the controller package, so that can be pulled in elsewhere anyway. We still may want a separate controller for this phase, but that may be overkill. Unless we decide that this is an opportunity for to remove a bottleneck with a new controller. If we do it in the backup controller, then a backup that's basically complete will have to wait for any existing running (and queued) backups to complete before this final (short) update cycle, which could be a long time. I think I'll start with adding it to the async controller, and if we decide it's necessary, we can move it to a new controller, but the backup controller is probably not a good place for it.
  2. I don't think we can update a file in a tar archive directly. It looks like the only thing we can really do is to first run the item backupper our itemsToUpdate and get a list of updated items, then download the tarball from the object store and open a tar reader on this, and a new tar writer, iterate through the tar archive (tar reader), and for each item we come across, compare the filepath to the pathnames of the updated items, and if there's a match, write the new item to the tar writer, and if there's not a match, copy the old item from reader to writer. Then upload the new tarball to the object store.

@Lyndon-Li
Copy link
Contributor

Lyndon-Li commented Feb 16, 2023

@sseago some comments for your this statement:

I actually think the cleanest solution may be to do this work in the async update controller and not in the backup controller.

I don't have too much inclination as long as we can make a general backup finalizing phase and make the logic reasonable

I don't think we can update a file in a tar archive directly

I see the difficulty to update a tar in the backup storage if it already exists. The difficulty comes down to we've uploaded the same item twice, once in the backup controller when the async operations are launched, the other is when the backup comes to FinalizingAfterPluginOperations status.

Merging with the general finalizing phase/handler concept, I suggest that we persist the overall backup data only once in the general finalizing handler. That is, we never persist the backup until the backup reaches to FinalizingAfterPluginOperations status. In the single persisting operation, we combine all the data and upload them once

@sseago
Copy link
Collaborator Author

sseago commented Feb 16, 2023

Merging with the general finalizing phase/handler concept, I suggest that we persist the overall backup data only once in the general finalizing handler. That is, we never persist the backup until the backup reaches to FinalizingAfterPluginOperations status. In the single persisting operation, we combine all the data and upload them once

I'd rather not take this approach. This was actually what Dave had proposed in his original design. The problem is that it is less resilient to velero pod crash/reboot. This is already a limitation of the Velero controllers -- if velero restarts with an InProgress backup or restore, that backup/restore will fail and must be restarted. If we defer upload until all async actions are done and finalizing happens, then we extend this problem further. I know in some of our discussions around how to fix Velero's crash/restart problem in the future, one possibility discussed has been uploading backup content more frequently during the backup process so Velero can pick up with at least some done (that's a future feature, not here).

But back to here -- with the current approach of uploading the backup and metadata upon completion of initial backup processing, if Velero crashes while the next backup is running and this backup is WaitingForPluginOperations, the only thing lost for this backup are the itemoperations updates that have not yet been written to object storage, so when velero comes back up, it can just query all plugins again for progress. I think the one final download before upload upon finalizing the backup is a small performance hit which gains resilience for us -- and it will only be required if there are plugin operations with updates during the async phase.

@sseago sseago force-pushed the async-backup-controller branch from ecbafce to 77e445a Compare February 20, 2023 02:33
@sseago
Copy link
Collaborator Author

sseago commented Feb 20, 2023

I've updated the PR to add the new finalizing phases and finalizing controller. The main change from previous PR discussion is rather than downloading and modifying the tarball during the finalize reconcile, I'm just creating a second tarball with only those resources indicated as needing update by the async backup item actions, so it really functions as an incremental backup. The restore workflow has been modified to download both tarballs (if present) and extract them in order, that way the updated resources overwrite the original copy prior to restore.

Two additional notes:

  1. There are a couple conflicts to resolve. I'll rebase and re-push on Monday.
  2. I have added and run unit tests with these changes, but I have not tested the latest changes with a real velero server and real plugins/backups/restores. I'll also do that on Monday.

@sseago sseago force-pushed the async-backup-controller branch from 77e445a to f36b407 Compare February 20, 2023 20:42
@sseago
Copy link
Collaborator Author

sseago commented Feb 20, 2023

Updated with the suggestions I'd made this morning. To make sure we don't get into a situation where we're updating resources in finalize that weren't in the original backup.tar.gz tarball, instead of returning an explicit list of items to update, it's been replaced with a boolean updateAdditionalItemsAfterOperation. If false, then no items will be added to the finalize list. If true (and the plugin returns an async operationID), then the items returned as additionalItems are also added to the finalize update list. Since the purpose of additional items is to add those items to the backup, this means the items we're adding to the finalize tarball were in the original tarball -- this is just capturing the status and other changes that happen as a result of the asynchronous operation. For datamover use cases, this means that the PVC (or VolumeSnapshot) BIA that creates a SnapshotBackup CR and returns its obj reference as an async operationID will also return the SnapshotBackup ResourceIdentifier as an additional item (to ensure that the SnapshotBackup gets included in the backup), and will return a true value for updateAdditionalItemsAfterOperation to ensure that the backup finalizer will update this resource in the backup with its final status. Next step for me is to test all of this with real backups in a real cluster with real plugins.

@shubham-pampattiwar
Copy link
Collaborator

+1 @sseago the latest comment seems apt. @Lyndon-Li all the concerns should have been addressed with the latest changes in this PR, WDYT ?

@Lyndon-Li
Copy link
Contributor

Lyndon-Li commented Feb 21, 2023

@sseago @shubham-pampattiwar Actually, the question is not about restricting updating items to the existing ones in the original backup.tar.gz at the finalizing phase, but about that we should not treat the SnapshotBackup CR (or similar objects created by plugin) as part of backup data.

See my comments from here and downwards.

@sseago
Copy link
Collaborator Author

sseago commented Feb 27, 2023

Updated again with design doc and api doc updates.

@sseago sseago force-pushed the async-backup-controller branch from 76a2eed to 89ef7da Compare February 28, 2023 14:53
@sseago
Copy link
Collaborator Author

sseago commented Feb 28, 2023

Oh, looks like I need to replace apimachinery/pkg/util/clock in this PR now. Already done for the rest of velero, but new files in my PR add this and that's causing failures. Will update.

@sseago sseago force-pushed the async-backup-controller branch from 89ef7da to a4069fc Compare February 28, 2023 15:14
@sseago
Copy link
Collaborator Author

sseago commented Feb 28, 2023

Clock replaced.

@sseago sseago force-pushed the async-backup-controller branch from a4069fc to 86ce1c3 Compare February 28, 2023 22:44
@@ -158,6 +163,12 @@ status:
volumeSnapshotsAttempted: 2
# Number of volume snapshots that Velero successfully created for this backup.
volumeSnapshotsCompleted: 1
# Number of attempted async BackupItemAction operations for this backup.
asyncBackupIemOperationsAttempted: 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@@ -34,6 +34,7 @@ message ExecuteResponse {
bytes item = 1;
repeated generated.ResourceIdentifier additionalItems = 2;
string operationID = 3;
repeated generated.ResourceIdentifier itemsToUpdate = 4;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DoesitemsToUpdate reflect the meaning of the return value as described in the comment of .go file:

a second slice of ResourceIdentifiers specifying related items which should be backed up after all asynchronous operations have completed.

But I don't have a good idea about naming here atm...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My original thought was itemsToUpdateAfterOperations but that felt too unwieldy. Another option would be itemsToFinalize or postOperationItems I'm fine with any of these, though -- lets discuss next week for a possible follow-on change. If we're going to change it, it needs to be before 1.11 is out. @shubham-pampattiwar any thoughts here?

Copy link
Contributor

@reasonerjt reasonerjt Mar 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sseago I would vote for postOperationItems if we see them as a second list of items to backup.
There are some inconsistencies I mentioned in
#5849 (comment)

Maybe we should refine the flow here and treat the items in itemsToUpdate or postOperationItems differently, for example, we may define the flow like the backupper will only put these items into tarball as-is without any additional work, there is still inconsistency, but this inconsistency is easier to understand...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@reasonerjt postOperationItems works for me. I'll update that in the follow-on PR. As for the inconsistencies between this and additional items, see comments below. I think we can converge things a bit more and minimize the inconsistency -- at this point, I think the only thing that must be different is that postOperationItems are not expected to create asynchronous operations, since that would potentially lead to never-ending backup.

"Name": "my-volume-vsc"
},
"operationID": "<DataMoverBackup objectReference>",
"itemsToUpdate": {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a list?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Fixing now.

_WaitingForPluginOperationsPartiallyFailed_). In the event of any of the progress checks return an
error, the phase will move to _WaitingForPluginOperationsPartiallyFailed_. The backup will then be
requeued and will be rechecked again after some time has passed.
_WaitingForPluginOperationsPartiallyFailed_), and the async backup operations controller will
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After re-reading the PR I find the terminology a little redundant, that sometimes it's called itemOperation sometimes it's called pluginOperation sometimes it's asyncPluginOperation, if there are no particular reasons for such difference would it make the communication easier if we unify the terms?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at it now, I think the actual words used in the API leave off async (i.e. command flags, status fields) -- those tend to be very long already, so it makes them more unwieldy. We tend to add "async" in the text descriptions, documentation, etc. to make it clearer though. I could remove "async" from the other part of the messaging, but I actually think they help to make it clearer the way it is now. I don't have a strong opinion here, so we can deal with any changes in a follow-on PR next week. I think for the controller naming, we can leave that as-is either way, since there's nothing user-visible there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's trasnparent to the user, I'm just thinking making the naming more consistent and simplified would help the communication between developers, and make the code easier to understand.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@reasonerjt if we need everything to either have "async" or not have it in the name, then I think we'd better remove it, since some of the names will be too long and unwieldy with them. I'm ok with removing Async where it appears, if that's ok with everyone. This would, for example, mean that asyncBackupOperationsReconciler would become backupOperationsReconciler.

pkg/archive/filesystem.go Outdated Show resolved Hide resolved
@sseago sseago force-pushed the async-backup-controller branch 2 times, most recently from 206df46 to 087c43c Compare March 1, 2023 16:48
@sseago sseago force-pushed the async-backup-controller branch from 087c43c to c3d1d83 Compare March 1, 2023 16:49
@sseago
Copy link
Collaborator Author

sseago commented Mar 1, 2023

@reasonerjt I fixed the three easy suggestions and updated the PR. The two comments related to naming things can be handled in a follow-on PR if we decide a change is necessary.

}
// cancel operation if past timeout period
if operation.Status.Created.Time.Add(backup.Spec.ItemOperationTimeout.Duration).Before(time.Now()) {
_ = bia.Cancel(operation.Spec.OperationID, backup)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the Cancel method is called, what is the expected return value when we call Progress again?
We have the other place calling getBackupItemOperationProgress in backup controller, so it is possible that Progress is called again after Cancel, is there any problem?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would probably depend on the plugin. But calling Progress after Cancel shouldn't crash anything. A plugin might put "Cancelled" in the progress description field, or it might just mark it as failed/errored.

design/general-progress-monitoring.md Show resolved Hide resolved
@@ -51,6 +52,16 @@ type Request struct {
PodVolumeBackups []*velerov1api.PodVolumeBackup
BackedUpItems map[itemKey]struct{}
CSISnapshots []snapshotv1api.VolumeSnapshot
itemOperationsList *[]*itemoperation.BackupOperation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a strong reason to make it a pointer to a slice?

Why not make it

type Request struct {
ItemOperationList []*itemoperation.BackupOperation
}

So it's more consistent with other fields?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@reasonerjt The reason it's a pointer is that the item backupper grabs this and appends to it. We need the pointer so we're appending to the list that belongs to the request and not the just a copy in backupItem

return r.getItems(nil)
}

// getAllItems gets all relevant items from all API groups.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wrong comment

@@ -164,7 +181,7 @@ func getOrderedResourcesForType(orderedResources map[string]string, resourceType
}

// getResourceItems collects all relevant items for a given group-version-resource.
func (r *itemCollector) getResourceItems(log logrus.FieldLogger, gv schema.GroupVersion, resource metav1.APIResource) ([]*kubernetesResource, error) {
func (r *itemCollector) getResourceItems(log logrus.FieldLogger, gv schema.GroupVersion, resource metav1.APIResource, resourceIDsMap map[schema.GroupResource][]velero.ResourceIdentifier) ([]*kubernetesResource, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this func becomes a little confusing b/c when resourceIDsMap is provided the include/exclude rules in the backupRequest will be ignored.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the issue here is that for finalize, we don't need to deal with include/exclude since we have a limited list of items that we know we need to include, so instead of include/exclude rules plus listing items from cluster, we pull items from the map. We still need the rest of this logic since it does all of the GVR and preferred version negotiation which is necessary for getting items in their proper path for the tarball. But I can add clearer comments to the func to make this distinction clear. There are other places in velero code where we skip include/exclude rules under certain contexts.

@@ -148,6 +148,26 @@ func (b *backupSyncReconciler) Reconcile(ctx context.Context, req ctrl.Request)
continue
}

if backup.Status.Phase == velerov1api.BackupPhaseWaitingForPluginOperations ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not merge the if statements and you can print the backup.Status.Phase in the log

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll do this in the follow-on PR.

delete(u.GetAnnotations(), mustIncludeAdditionalItemAnnotation)
}
obj = u
if finalize {
Copy link
Contributor

@reasonerjt reasonerjt Mar 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So that means when calling BIAv2 plugin in the Finalization phase, the async operation may still be triggered b/c we can't control that, and we choose not to track the progress of these operations, IMO this is a little problematic.
This probably will not happen currently, but we should at least document it in the design clearly.

Additionally, it may not be right to ignore the additional Items returned by the BIA.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We definitely want to document for users that they should not be starting async actions if backup phase is finalize -- and I'll update my example plugin to check phase as an example. There's a real risk of an infinite loop scenario if we track async operations started by plugins during finalize, as the plugin would have to go back to waiting for plugin operations, then finalize again, where new async operations could start, etc.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional items is less complicated. We can definitely process them here if we need to. In the original design, it was unnecessary since we were actually backing up these items on the first pass, then finalize was just backing them up again for updates, so the additional items were already included. But with the current implementation, these items are deferred until finalize, so they don't have the opportunity to add additional items anymore. We may want to add that back, although it will add some complications to processing. I'll consider this problem in the follow-on PR.

return false, itemFiles, err
}

if !finalize && groupResource == kuberesource.Pods {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So there will be problem if we backup data in the Finalization phase?

So the Finalization phase was introduced to handle the itemsToUpdate list, and make sure they are backed up, I was thinking we should handle these items as how we did in the regular backup process.

We should avoid the case that the itemsToUpdate are handled differently in many ways in terms of backup.
If there's a very strong reason for that, let's make sure we document the difference in the flow and the reason behind such difference.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my general comment at the end. I think I'm closer to agreement with you here now, at least for pods/pvcs. I think I will probably be able to remove this finalize special case.

c.metrics.RegisterBackupItemsErrorsGauge(backupScheduleName, backup.Status.Errors)
// FIXME: download/upload results once https://github.com/vmware-tanzu/velero/pull/5576 is merged
}
removeIfComplete := true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite sure why this is needed, is there any use case we need the data in the map after the backup is in the terminal state?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, we don't remove this if there are errors hit in encoding/uploading, or the backup patching fails. The thinking was that if we had errors like that, they could be transient so we should try again on the next iteration. That being said, we could simplify this by always removing on completion. That would mean that we may not completely recover from transient errors (i.e. object store may not get updated properly if there was a transient object store update error). I don't have a strong opinion here, though.

backup.Status.Phase == velerov1api.BackupPhaseFinalizingAfterPluginOperations ||
backup.Status.Phase == velerov1api.BackupPhaseFinalizingAfterPluginOperationsPartiallyFailed) {

c.deleteOperationsForBackup(backup.Name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From an OO perspective the delete and put should be funcs for BackupItemOperationsMap

opsLock sync.Mutex
}

// If backup has changes not yet uploaded, upload them now
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment should be like

// UpdateForBackup ...

And it should explain what this func is for.
BTW, the current comment // If backup has changes not yet uploaded, upload them now seems should be at the place where this func is called...

logger logrus.FieldLogger
clock clocks.WithTickerAndDelayedExecution
frequency time.Duration
itemOperationsMap *BackupItemOperationsMap
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems the BackupItemOperationsMap serves as a global map to store the operation list for each backup, and it's shared between asyncBackupOperationsReconciler and downloadRequestRonconciler

Therefore, ideally the lifecycle of the map should be separated from this controller. It should be created in server.go as a utility?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@reasonerjt It's more limited than that. First, it's only a map for backups currently waiting for item operations. So 99% of what's done here is only by the asyncBackupOperationsReconciler. The downloadRequestReconciler only uses this to force a push to object store when getting a download request for a backup's item operations if they're ongoing. However, with our prior discussions here, we talked about how we want to change this for 1.12 to not use a download request here and possibly use a new CR for this. So it is primarily managed by the async reconciler -- but we still could split it out as a utility and pass it in to both controllers. I'll look into this with the follow-on PR. If splitting it out makes it easier to follow the code, I'll consider doing that. Functionally, the two would be equivalent, so the main question will be ease of understanding/maintenance of the code, so if spliting it out helps, then we should probably do that.

@reasonerjt
Copy link
Contributor

I have an overall comment regarding how to handle the itemsToUpdate:

I was thinking the second list of items returned by BIAv2 ("itemsToUpdate") should be treated as much equally as possible as the first list of additional Items.
Seems now it's treated quite differently with finalize==True, we skipped hook execution, backing up the data in PV, and we ignored the operation ID and additionalItems when executing BIA against these items.
IMO, since we are seeing the list of itemsToUpdate as a list of items to be "backed up", we should treat them as regular items in terms of the backup flow. If there're strong reasons that they must be treated differently, then maybe we should use a different approach, i.e. we only put them into the backup tarball, and nothing else. Otherwise, I see there will be growing confusion when we make changes in the backup flow in the future, we need to think again and again regarding "whether this should apply to the itemsToUpdate or not"

@reasonerjt
Copy link
Contributor

reasonerjt commented Mar 6, 2023

@sseago as we've discussed, I'm adding the approval to unblock the work.

But there are a few comments I think we'll need to address before v1.11 release, feel free to ping me on slack.

@sseago
Copy link
Collaborator Author

sseago commented Mar 6, 2023


I have an overall comment regarding how to handle the itemsToUpdate:
I was thinking the second list of items returned by BIAv2 ("itemsToUpdate") should be treated as much equally as possible as the first list of additional Items.
Seems now it's treated quite differently with finalize==True, we skipped hook execution, backing up the data in PV, and we ignored the operation ID and additionalItems when executing BIA against these items.
IMO, since we are seeing the list of itemsToUpdate as a list of items to be "backed up", we should treat them as regular items in terms of the backup flow. If there're strong reasons that they must be treated differently, then maybe we should use a different approach, i.e. we only put them into the backup tarball, and nothing else. Otherwise, I see there will be growing confusion when we make changes in the backup flow in the future, we need to think again and again regarding "whether this should apply to the itemsToUpdate or not"

The thinking is that we make the backup as normal as possible. Thinking through the cases, we may be able to eliminate part of this, but here's the challenge -- there are good reasons for running normal BIA plugins, and there are good reasons for skipping additonal items/async processing. Lets consider the things being done:

  1. calling BIA plugins. This is pretty much essential if we are going to avoid data inconsistencies. For a given resource type, the actual kubernetes data pushed to the backup tarball needs to be the same, whether it happened during initial backup or during finalize. Not calling BIA plugins would result in potential data inconsistencies. That's why this is done in all cases, whether or not finalize is set.
  2. additional item processing. We're currently ignoring additional items, since anything pulled in through the first run of everything was already included. That being said, it may make sense to add this back in here to reduce the number of finalize special casing.
  3. Hook processing. Again, I was thinking that we didn't want to run hooks on finalize, but now that I look back, maybe we should just do this anyway. Are there any downsides to running hooks at this time? Realistically, we're only running hooks on pods, which we don't expect to have here anyway, since I don't see a use case where an async plugin would create a new pod to add only post-async-processing.
  4. PV backup. I don't think we'll actually encounter this in practice. We'd only run into this if an async plugin added a PV/PVC to be included post-async processing, but again this doesn't really make much sense. Again, my question is would it be OK to just remove the finalize-specific check here?
  5. async operation processing. This is really the only one that we probably really shouldn't do during finalize. We've already gone through the waiting for plugin operations phase, we're finalizing. Accepting an operation ID here would require us to return to the prior phase, which would then require re-finalizing again, and we run into the possibility that a badly-written plugin could cause an infinite loop here. Also, again, I'm not sure there's a use case here. It would mean an async plugin creates something that then gets processed for a while, and then once processing is done, the new item is backed up and that new item also creates something that gets processed by yet another external controller. If we have a genuine use case, we could implement this -- but it wouldn't eliminate special-case processing. It would just change the "ignore operation id" special case to "if we get new operation IDs we need to send the backup back to waiting for plugin operations" instead.

To summarize the above, we really need the backup item workflow to allow basic BIA plugins to run here to ensure data consistency. I'm thinking that we may be able to just eliminate the "finalize==true" special cases around Hook processing and PV backup. As for additional items, maybe that too. For async operation processing, I'm thinking we leave the special case here but revisit post-1.11 if we identify a use case that needs it -- doing this will add a good bit of extra logic at this late stage, while the other changes are really just simplifications.

In any case, lets consider the above and I'll add a follow-on PR for pv/hook/additional item changes if we agree to do that, and lets hold off on the other. I'll get to the other review comments later today.
@shubham-pampattiwar any thoughts on the above?

@reasonerjt
Copy link
Contributor

reasonerjt commented Mar 6, 2023

To summarize the above, we really need the backup item workflow to allow basic BIA plugins to run here to ensure data consistency. I'm thinking that we may be able to just eliminate the "finalize==true" special cases around Hook processing and PV backup. As for additional items, maybe that too. For async operation processing, I'm thinking we leave the special case here but revisit post-1.11 if we identify a use case that needs it -- doing this will add a good bit of extra logic at this late stage, while the other changes are really just simplifications.

I prefer we make the operation processing only special-case, but in theory, it's hard to enforce that when the BIAv2 plugin is called against an item in finalize phase it does not trigger any async Ops, any good idea.

@sseago
Copy link
Collaborator Author

sseago commented Mar 6, 2023

@reasonerjt So I think we're at the point now where we can probably do basically all the same things for items in finalize as normal backup (removing most of the inconsistencies in the current PR). The only thing we won't do in finalize is create async operations, as described above. This leaves the question of how to handle it if a plugin goes against documented recommendations and starts an operation. Current code ignores operation ID, but I'm thinking instead we should log an error if this happens. That's the closest we can come to enforcing that it doesn't trigger them. If we create errors for this, then that will raise the visibility vs. just ignoring the created operation.

@shubham-pampattiwar
Copy link
Collaborator

shubham-pampattiwar commented Mar 7, 2023

@reasonerjt @sseago I am on board with most of the agreed upon changes regarding pv/hook/additional items, these changes can be done in a follow-up PR.

@shubham-pampattiwar shubham-pampattiwar merged commit 8bed159 into vmware-tanzu:main Mar 7, 2023
@sseago
Copy link
Collaborator Author

sseago commented Mar 7, 2023

@reasonerjt @shubham-pampattiwar I will be working on a follow-on PR to address several points from this review over the next couple days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement backup monitoring for backupItemAction
5 participants