Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DVC fails to push data from external cache to default remote #4686

Closed
EmmaBYPeng opened this issue Oct 9, 2020 · 10 comments
Closed

DVC fails to push data from external cache to default remote #4686

EmmaBYPeng opened this issue Oct 9, 2020 · 10 comments
Labels
p2-medium Medium priority, should be done, but less important question I have a question?

Comments

@EmmaBYPeng
Copy link

EmmaBYPeng commented Oct 9, 2020

Bug Report

Please provide information about your setup

Output of dvc version:

$ dvc version
DVC version: 1.8.1 (pip)
---------------------------------
Platform: Python 3.7.4 on Darwin-19.5.0-x86_64-i386-64bit
Supports: gs, http, https
Cache types: <https://error.dvc.org/no-dvc-cache>
Repo: dvc, git

Use case

We want to track data on GCS using DVC (w/o downloading to local machines), with an external cache and remote storage

Steps

dvc remote add cache gs://my-project/datasets_dvc/cache  # external cache bucket
dvc config cache.gs cache
dvc remote add -d storage gs://my-project/datasets_dvc/storage  # storage bucket
dvc add --external gs://my-project/datasets/labeled_sentences.csv  # data added to the cache bucket
dvc push  # nothing showed up in the storage bucket

Issues

After the push, we got Everything is up to date., while nothing showed up in the default storage bucket (our cache bucket did have the cached data, though)

I'm new to DVC so I might have misunderstood the external data workflow. Please let me know if I missed anything in the above steps!

Reference: https://dvc.org/doc/user-guide/managing-external-data

@karajan1001
Copy link
Contributor

$ dvc push # push only the local caches to the default remote.

And in documents

  1. Tracking existing data on an external location with dvc add (this doesn't download it).

Maybe we have to pull it down first?

@pared
Copy link
Contributor

pared commented Oct 9, 2020

@EmmaBYPeng
While this seems like a bug (since the push should move the data from cache to storage), maybe you actually do not need the storage? From the user perspective cache and remote are the same thing, and their structure is identical (assuming they contain the same data). So, if you want to use your cache later, you can specify it as a remote for other projects.

What is your use case? Maybe we can work around that for now.

@karajan1001
I am afraid pulling won't help, as its external dependency - pull should actually check out from gs cache to gs://my-project/datasets/labeled_sentences.csv

PS
I reproduced the behavior for the local machine and external dependency, and push successfully sends cache content to storage,.

@pared pared added bug Did we break something? p2-medium Medium priority, should be done, but less important labels Oct 9, 2020
@EmmaBYPeng
Copy link
Author

EmmaBYPeng commented Oct 9, 2020

Thanks for the response!

re same location for cache and remote: #3703 seems to suggest that it's bad to have the external cache and remote storage be the same thing

@pared I guess what we are looking for is to use DVC as a data registry to track data stored on GCS, which multiple developers (including CI) can read/write. We don't want to store the data locally since 1) the data is big, and 2) our ML pipelines need to directly read data from GCS.

Our workflow should look something like:

  • Person A commits a file (e.g. gs://shared-project/datasets/labeled_sentences.csv) to be tracked by DVC
  • Person A makes edits to the file, which are tracked and saved in the external cache
  • Person B checks out the file at a specific commit, also makes edits, and save his edits in the external cache, while person A can still use the file at a different commit

My questions are:

  • Does this workflow make sense for the use case? or is there a better alternative?
  • Will A and B run into conflicts? e.g. after B checks out the file at a specific commit, does the file change on A's end?
  • Does it require one external cache per developer? (not sure if it's related to external workspaces #3920)

@pared
Copy link
Contributor

pared commented Oct 9, 2020

@EmmaBYPeng

Does this workflow make sense? or is there a better alternative?

Your workflow makes perfect sense. If you want to store your data on gs there probably is not too much one can do.

Will A and B run into conflicts? e.g. after B checks out the file at a specific commit, does the file change on A's end?

Yes, let's remember that DVC in this case is tracking external file - so any dvc checkout operation affects this particular file, which has a constant URL - multiple users can start interfering with each other work.
If 2 users edit this external file and commit +/- at the same time, we might get some unexpected results - in this case I would recommend using dvc get or dvc import to create "temporary" place to work and saving it once you are sure, you won't interfere with each other's work.

Does it require one external cache per developer?

No, its actually better to have single cache.

@efiop
Copy link
Contributor

efiop commented Oct 12, 2020

For the record: push/pull/fetch/status -c don't work for external outputs, only for local ones. It is designed to do so, because our assumption that you don't want to move external outputs from the already configured external cache. Also have to point out that --external is an advanced experimental feature, and as @pared noticed, you probably don't want to use it in this case because of the described implications.

@karajan1001
Copy link
Contributor

For the record: push/pull/fetch/status -c don't work for external outputs, only for local ones. It is designed to do so, because our assumption that you don't want to move external outputs from the already configured external cache. Also have to point out that --external is an advanced experimental feature, and as @pared noticed, you probably don't want to use it in this case because of the described implications.

So this is an issue of dvc.org?

@pared
Copy link
Contributor

pared commented Oct 12, 2020

@karajan1001

So this is an issue of dvc.org?

I guess, since we mention the requirement for external cache in case of external outputs, we could mention there that it is the cache that will store the dependencies, and pushing them to other remotes will have no effect.

@karajan1001
Copy link
Contributor

@karajan1001

So this is an issue of dvc.org?

I guess, since we mention the requirement for external cache in case of external outputs, we could mention there that it is the cache that will store the dependencies, and pushing them to other remotes will have no effect.

Yes, and in the dvc push page we only push the local cache, not the external one.

@pared
Copy link
Contributor

pared commented Oct 14, 2020

@karajan1001 @EmmaBYPeng @efiop I created an issue on docs to clarify this use case.

@efiop
Copy link
Contributor

efiop commented Oct 19, 2020

Closing in favor of iterative/dvc.org#1865

@efiop efiop closed this as completed Oct 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
p2-medium Medium priority, should be done, but less important question I have a question?
Projects
None yet
Development

No branches or pull requests

4 participants