Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Force push option #7268

Open
jpaasen opened this issue Jan 14, 2022 · 10 comments
Open

Force push option #7268

jpaasen opened this issue Jan 14, 2022 · 10 comments
Labels
A: data-sync Related to dvc get/fetch/import/pull/push feature request Requesting a new feature p2-medium Medium priority, should be done, but less important

Comments

@jpaasen
Copy link

jpaasen commented Jan 14, 2022

I recently experienced corrupted data when transferring large files to google cloud storage (gcs).

See discussion on Discord here.

In short, the md5 of the files at the remote was different than the md5 filename given to it by DVC. And since the md5 values at the remote were the correct ones, it was not possible to push the data one more time to get it right.

Right now there are now commands in DVC that can resolve an issue like this without losing data history.

You could do:

dvc remove data.dvc
dvc gc -w -c
dvc add data.dvc
dvc push

But this will delete all history for all files.

To solve this issue, our team had to backtrace the md5 values of the corrupted files and delete them manually from the gcs.

A "simple" solution would be to have a force option on dvc push (-f). That copies the data even if the md5 sha values are equal.

@efiop
Copy link
Contributor

efiop commented Jan 14, 2022

@jpaasen Do you know how the corruption happened? We verify local files before uploading them, so it is quite unusual for files to get corrupted on remote.

Alternative/additional approach here would be to use existing verify remote config option (and/or introduce a corresponding CLI flag) to not only check files when downloading them but to also check remote files when trying to upload (and probably during dvc status -c too).

@efiop efiop added awaiting response we are waiting for your reply, please respond! :) feature request Requesting a new feature labels Jan 14, 2022
@daavoo
Copy link
Contributor

daavoo commented Jan 14, 2022

@jpaasen Do you know how the corruption happened? We verify local files before uploading them, so it is quite unusual for files to get corrupted on remote.

Answered in discord:

We are using DVC for quite large files. Typically > 1GB per file. The files got corrupted when the person doing the push turned on VPN in the midle of the transfer. I guess gsutil (or whatever you are using under the hood) should have handled this in a graceful way, but it did not.

@jpaasen
Copy link
Author

jpaasen commented Jan 17, 2022

Thank you @daavoo

@daavoo daavoo added the A: data-sync Related to dvc get/fetch/import/pull/push label Feb 22, 2022
@pmrowla pmrowla added p2-medium Medium priority, should be done, but less important and removed awaiting response we are waiting for your reply, please respond! :) labels Apr 1, 2022
@Luux
Copy link

Luux commented Aug 1, 2023

I recently ran into a similar problem. Is there any update on this?

@dberenbaum
Copy link
Collaborator

No updates yet. @iterative/dvc Any estimate on the level of effort?

@serious-gist
Copy link

serious-gist commented Aug 16, 2023

I also ran into this issue. Any update on this?

@dberenbaum dberenbaum added this to DVC Aug 16, 2023
@github-project-automation github-project-automation bot moved this to Backlog in DVC Aug 16, 2023
@dberenbaum
Copy link
Collaborator

@y-ksenia
Copy link

y-ksenia commented Aug 25, 2023

Similar problem (on google cloud remote). Somehow not all files are sent to remote, and pulling to another machine doesn't work properly.

@anumita0203
Copy link

I'd like to contribute to this issue. I've spent some time searching for potential solutions, and I'd love some feedback on whether I'm heading in the right direction.

My understanding of the issue is that network problems can occasionally lead to file corruption when performing dvc push. However, since the filename has the same MD5 value as the local file, rather than the actual MD5 value of the uploaded file, DVC is unable to detect the corruption. As a result, DVC performs operations on the corrupted file as it would on an uncorrupted file. For instance, it doesn't replace the file during subsequent dvc push commands.

My high-level solution is to

  1. Pass the MD5 value of files in the Content-MD5 header for AWS, Azure, GCP and other cloud platforms which support it. These platforms calculate the MD5 hash of the uploaded file and compare it with the Content-MD5 header's value. In case the two values do not match, the upload is rejected.

  2. Set "verify" to true by default for all other remote types, similar to how it's currently implemented for Google Drive. My reasoning behind this suggestion is if we are unable to ascertain file integrity during the push phase, we can still validate it during actions like "dvc fetch" or "dvc pull".

@dberenbaum
Copy link
Collaborator

Discussed this issue today. Takeaways:

  1. We would like to understand why files are being corrupted and prioritize that over force pushing.
  2. Before implementing force push, we need to understand if the issues are with the remote storage or the local index and possibly overwrite both.
  3. Overwriting files may be restricted on some remote types, so we may need to delete before writing, which would add complexity.

@dberenbaum dberenbaum moved this from Backlog to Todo in DVC Sep 6, 2023
@dberenbaum dberenbaum moved this from Todo to Backlog in DVC Oct 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: data-sync Related to dvc get/fetch/import/pull/push feature request Requesting a new feature p2-medium Medium priority, should be done, but less important
Projects
No open projects
Archived in project
Development

No branches or pull requests

9 participants