Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dvc push doesn't recognise that files are missing in remote storage #4164

Closed
dldx opened this issue Jul 3, 2020 · 19 comments
Closed

dvc push doesn't recognise that files are missing in remote storage #4164

dldx opened this issue Jul 3, 2020 · 19 comments

Comments

@dldx
Copy link

dldx commented Jul 3, 2020

For some reason, I was missing a number of files in my remote storage on GCS. When I run dvc pull, it fails with the following error:

ERROR: failed to download 'gs://redacted/dvc/redacted/00/642ad56326eb0b6caf3784810a49a0' to '../.dvc/cache/00/642ad56326eb0b6caf3784810a49a0' - 'NoneType' object has no attribute 'size'

In order to fix this, I decide to run dvc add and dvc push on a machine which still has these files. The commands run fine, and dvc push reports that everything is fine. However, in reality, it does not upload the missing files to the remote cache. In the end, I had to solve this by manually uploading these files to my remote storage in GCS.

Bug Report

Please provide information about your setup

Output of dvc version:

$ dvc version 
DVC version: 1.1.6
Python version: 3.7.6
Platform: Linux-4.9.0-12-amd64-x86_64-with-debian-9.12
Binary: False
Package: pip
Supported remotes: gs, hdfs, http, https, ssh
Cache: reflink - not supported, hardlink - supported, symlink - supported
Filesystem type (cache directory): ('ext4', '/dev/sda1')
Repo: dvc, git
Filesystem type (workspace): ('ext4', '/dev/sda1')

dvc_push_log.txt

@triage-new-issues triage-new-issues bot added the triage Needs to be triaged label Jul 3, 2020
@efiop
Copy link
Contributor

efiop commented Jul 3, 2020

@dldx Could you try removing .dvc/tmp/index and try again?

Did someone run garbage collection on your remote by any chance?

@efiop efiop added the awaiting response we are waiting for your reply, please respond! :) label Jul 3, 2020
@triage-new-issues triage-new-issues bot removed the triage Needs to be triaged label Jul 3, 2020
@dldx
Copy link
Author

dldx commented Jul 3, 2020

@efiop Hmm, nope, that didn't change anything... It is easy for me to replicate this. I just need to delete a file in the remote storage after I have pushed it.

I didn't run gc but someone else may have (I have warned everyone about running gc on the remote). I'm not really sure yet what caused it but I'm wondering if there are other missing files as well.

@efiop
Copy link
Contributor

efiop commented Jul 3, 2020

I just need to delete a file in the remote storage after I have pushed it.

@dldx That messes up with our index, hence why I've asked to remove .dvc/tmp/index to force it to check things manually. Could you remove it and check dvc status -c ?

@dldx
Copy link
Author

dldx commented Jul 3, 2020

Sorry for the confusion. I did remove the index, but it didn't trigger a check.

$ rm -rf ../.dvc/tmp/index        
$ dvc status -c PSScene4Band.dvc  
Data and pipelines are up to date.  
$ dvc push -R PSScene4Band
Everything is up to date.

No push triggered even though there are remote files missing.

@efiop
Copy link
Contributor

efiop commented Jul 3, 2020

@dldx Are you sure you've deleted a file that is used in PSScene4Band.dvc ?

@efiop
Copy link
Contributor

efiop commented Jul 3, 2020

@dldx Ah, sorry, I've missed that we actually trust the remotes in regards to the files in directories in order to make dvc operations faster. If your collegues are deleting stuff randomly from the cloud, you might consider making your remote untrusted with:

dvc remote myremote verify true

that will make it paranoid again.

@efiop
Copy link
Contributor

efiop commented Jul 3, 2020

@dldx But usually even during gc, we delete the .dir cache file first and then remove the actual files. Could you please double check that everyone is using the latest dvc version?

@efiop
Copy link
Contributor

efiop commented Jul 3, 2020

Thinking about it, we could consider throwing a warning if we assume that file is there but not able to pull it.

@efiop
Copy link
Contributor

efiop commented Jul 14, 2020

@dldx Any updates? 🙂

@efiop efiop closed this as completed Jul 21, 2020
@raharth
Copy link

raharth commented Jun 11, 2021

Sorry to reopen this, but I have a nearly identical problem. Though, in my case it is not caused by deleting files in the remote, but by adding a new remote to which I want to push.

I have tried both suggestions, removing tmp/index and setting verify to true, both is not solving the problem I only receive a "Everything is up to date", which is incorrect since the files do not appear in the remote storage.

@efiop
Copy link
Contributor

efiop commented Jun 11, 2021

@raharth Could you elaborate on how you are detecting that they do not appear? Also, please show dvc doctor output.

@raharth
Copy link

raharth commented Jun 11, 2021

@efiop Thanks for your fast reply!

The remote is an Azure container to which I have access, hence I can see that there are no files appearing.

Platform: Python 3.8.9 on Windows-10-10.0.18362-SP0
Supports: All remotes
Cache types: hardlink
Cache directory: NTFS on D:\
Caches: local
Remotes: azure, azure
Workspace directory: NTFS on D:\
Repo: dvc, git

@efiop
Copy link
Contributor

efiop commented Jun 11, 2021

@raharth Are any dvc files gitignored? Could you show dvc list . . --dvc-only, please?

@efiop
Copy link
Contributor

efiop commented Sep 23, 2021

@vladimircape Could you show the contents of dvc_targets/checkpoints.dvc, please? There will be a 12345.dir md5 in there, that is correpsonding to the data directory that is tracked by it. It is possible that it was pushed by mistake before all files got transfered (with some older dvc version). Could you try manually removing 12/3456.dir from your remote and then dvc push-ing again?

From our side, it would be handy to add an option to ignore the index, of course.

@efiop
Copy link
Contributor

efiop commented Sep 23, 2021

@vladimircape Could you try deleting s3://mybucket/myprefix/bf/2d7e8a1acdc2d4eda3e55f66fa3bda.dir on s3 by hand (using awscli or through webui) (where s3://mybucket/prefix is url for your dvc remote) and try dvc push again, please?

@themaikelman
Copy link

themaikelman commented Mar 31, 2022

Hello! First of all I wanted to thank you for how well you always answer, it's really nice and trustworthy.

We have had the same problem, a dvc push of a folder with 41 files, which ended correctly (apparently, because there were no error messages) and yet only 40 files were uploaded.

Doing dvc pull on another server got us the error.

Now there has been no way to force the push where the files are because it only checks that the folder.dvc (f9c8def4b2a1a6b783209d933e26a6.dir) exists on the remote and not the files that are inside the folder.

Is there a way to do dvc push --recursive or maybe dvc push --force it to try to upload the files again?

By the way, we deleted the file f9c8def4b2a1a6b783209d933e26a6.dir from the remote and this time the missing files were uploaded, but now we have the doubt if it has happened to us in other projects.

@themaikelman
Copy link

More info:

atekoa@ubuntu:~/projects/test_dvc_issues$ dvc push -vvv 
2022-03-31 16:50:30,356 TRACE: Namespace(all_branches=False, all_commits=False, all_tags=False, cd='.', cmd='push', cprofile=False, cprofile_dump=None, func=<class 'dvc.commands.data_sync.CmdDataPush'>, glob=False, instrument=False, instrument_open=False, jobs=None, pdb=False, quiet=0, recursive=False, remote=None, run_cache=False, targets=[], verbose=3, version=None, viztracer=False, viztracer_depth=None, with_deps=False, yappi=False)
2022-03-31 16:50:30,531 TRACE: Assuming '/home/atekoa/projects/test_dvc_issues/.dvc/cache/30/f9c8def4b2a1a6b783209d933e26a6.dir' is unchanged since it is read-only
2022-03-31 16:50:30,532 TRACE: Assuming '/home/atekoa/projects/test_dvc_issues/.dvc/cache/30/f9c8def4b2a1a6b783209d933e26a6.dir' is unchanged since it is read-only
2022-03-31 16:50:30,598 DEBUG: Preparing to transfer data from '/home/atekoa/projects/test_dvc_issues/.dvc/cache' to 'https://myremote.com:443/remote?remote=6444'
2022-03-31 16:50:30,598 DEBUG: Preparing to collect status from 'https://myremote.com:443/remote?remote=6444'
2022-03-31 16:50:30,598 DEBUG: Collecting status from 'https://myremote.com:443/remote?remote=6444'
2022-03-31 16:50:30,599 DEBUG: Querying 1 hashes via object_exists
2022-03-31 16:50:31,546 DEBUG: Querying 0 hashes via object_exists                                                                                                               
2022-03-31 16:50:31,547 DEBUG: Querying 1 hashes via object_exists                                                                                                               
Everything is up to date.                                                                                                                                                        
2022-03-31 16:50:32,422 DEBUG: Analytics is disabled.

and, of course, dvc doctor:

atekoa@ubuntu:~/projects/test_dvc_issues$ dvc doctor
DVC version: 2.9.6.dev26+g12de7e7f 
---------------------------------
Platform: Python 3.8.10 on Linux-5.13.0-39-generic-x86_64-with-glibc2.29
Supports:
        azure (adlfs = 2021.10.0, knack = 0.8.2, azure-identity = 1.6.1),
        webhdfs (fsspec = 2022.2.0),
        http (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
        https (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
        s3 (s3fs = 2022.2.0, boto3 = 1.21.9),
        ssh (sshfs = 2021.11.2)
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/nvme0n1p5
Caches: local
Remotes: https
Workspace directory: ext4 on /dev/nvme0n1p5
Repo: dvc, git

@pmrowla pmrowla removed the awaiting response we are waiting for your reply, please respond! :) label Apr 1, 2022
@pmrowla
Copy link
Contributor

pmrowla commented Apr 1, 2022

@atekoa this is not currently planned, but as @efiop noted previously it would be good for us to have this flag to allow force pushing directories. I've created a separate issue that you can follow to keep track for further updates on this

@nielstenboom
Copy link

Ran into the same issue atekoa describes well.

By the way, we deleted the file f9c8def4b2a1a6b783209d933e26a6.dir from the remote and this time the missing files were uploaded, but now we have the doubt if it has happened to us in other projects.

Resorted to doing this as well for our s3 remote and then pushing the files again worked. Something like a force push with a -f or --force flag would've been fantastic!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants