Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

shared cache and dvc import #4476

Closed
wdixon opened this issue Aug 26, 2020 · 9 comments
Closed

shared cache and dvc import #4476

wdixon opened this issue Aug 26, 2020 · 9 comments
Labels
awaiting response we are waiting for your reply, please respond! :)

Comments

@wdixon
Copy link

wdixon commented Aug 26, 2020

This is more of a question - related to setting up data registries and the implications of shared cache with dvc import.

Presently I have a few datasets - each created as a separate git/dvc project (each say in the 1000GB range).
Each dataset contains a group of specific images, along with several different annotations types.
Each dataset has been configured to use a separate (independent) shared cache on network attached storage - visible to several shared development servers(s)

/network/storage/shared_dvc/cache/project_A
/network/storage/shared_dvc/cache/project_B
/network/storage/shared_dvc/cache/project_C

This part is working.

Now the question arises from consuming these registries - with a 4th project (project_D). This project contains the code defining a DL network and training script.. The network consumes a composite of information contained in registries project_B and project_C ( accomplished with dvc import )

It would seem unnecessary to duplicate the cache storage.

  1. Is there a way to share the existing caches for project_B and project_C?
  2. Should all these independent DVC/git projects be configured to use the same cache dir?
  3. Do we setup a shared cache for project_D - which will have its own independent shared cache/copy, duplicating a subset of project_B and project_C + whatever we are tracking in D?

The datasets eat up storage fairly quickly - looking for guidance to minimize the impact of duplicate copies

@wdixon
Copy link
Author

wdixon commented Aug 26, 2020

And just to comment.... I did attempt to configure this such that all of the DVC projects point to a single shared cache - and it appears to work (project_A, project_B, project_C, project_D). Any concerns?

@efiop
Copy link
Contributor

efiop commented Aug 26, 2020

@wdixon Using the same shared cache dir is a good approach. The only thing that you need to be aware of is that dvc gc will need to know about all of those projects, or it might delete some files. See dvc gc --projects.

@efiop efiop added the awaiting response we are waiting for your reply, please respond! :) label Aug 26, 2020
@wdixon
Copy link
Author

wdixon commented Aug 26, 2020

thank you.... It might be nice to show an example of 2 projects using the same cache in the documentation. It wasn't clear the first time reading the page that the shared cache was meant to be shared across projects. Going back and reading again - i do see it indicates for "everybody's projects"

@wdixon wdixon closed this as completed Aug 26, 2020
@efiop
Copy link
Contributor

efiop commented Aug 26, 2020

@wdixon You mean in the https://dvc.org/doc/use-cases/shared-development-server ? So your case is multiple projects and multiple users, right?

@wdixon
Copy link
Author

wdixon commented Aug 26, 2020

Yes, that is correct URL. and Yes, multiple projects, and multiple users. Initially I had setup separate caches for each project (which would be fine, when the projects are completely independent). However, if you pull in (through a dvc import) a portion of a data registry, that is where I encountered the cache duplication. I ended up trying a single cache directory for all the projects, which seems to be what you recommend - and now don't face any duplicate cache storage.

Thanks for the help.

@shcheklein
Copy link
Member

@wdixon btw, just in case you missed this - you might want to enable cache.type symlink. If your cache is located on a different volume or NAS that would help you to avoid copy operations from/to cache on dvc add, dvc checkout, etc.

@jorgeorpinel should we update the doc a bit to make it explicit that shared cache is also about sharing cache across different projects?

@jorgeorpinel
Copy link
Contributor

Sure. I should be getting to that use case soon. Will keep this in mind 👍

@wdixon
Copy link
Author

wdixon commented Aug 26, 2020

@jorgeorpinel should we update the doc a bit to make it explicit that shared cache is also about sharing cache across different projects?

I do think a bit more on the docs related to sharing cache across projects would be helpful.

@jorgeorpinel
Copy link
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting response we are waiting for your reply, please respond! :)
Projects
None yet
Development

No branches or pull requests

4 participants