Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how-to: setup a shared cache (extracted from use cases) #2482

Merged
merged 27 commits into from
Jun 29, 2021
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
9bca7fe
cases&how: generalizing shared dev server case and
jorgeorpinel May 16, 2021
c776a77
cases: rewrite intro to Sharing Res and
jorgeorpinel May 17, 2021
4fd980f
Merge branch 'master' into cases/shareing-res
jorgeorpinel May 18, 2021
204e994
cases: quick updates per #2482 feedback
jorgeorpinel May 18, 2021
45165da
Merge branch 'master' into cases/shareing-res
jorgeorpinel May 22, 2021
3524113
cases: connect cache/remote solutions to Sharing Res problem stmt
jorgeorpinel May 22, 2021
9090ddf
cases: move img down, reorg some Ps, remove WIP comment
jorgeorpinel May 22, 2021
785c230
Merge branch 'master' into cases/shareing-res
jorgeorpinel May 23, 2021
62b4fed
ref: update destroy external cache example
jorgeorpinel May 23, 2021
7af8a34
Merge branch 'cases/shared-dev/external-cache' into cases/shareing-res
jorgeorpinel May 23, 2021
25cfd87
Merge branch 'cases/shared-dev/external-cache' into cases/shareing-res
jorgeorpinel May 23, 2021
6410f85
cases: add figure to Shared Resources intro
jorgeorpinel May 23, 2021
3a5f057
cases: wrap up Shared Resources story (1st version)
jorgeorpinel May 23, 2021
02e0f7e
cases: fix links to old case and
jorgeorpinel May 23, 2021
ec97b8f
Merge branch 'master' into cases/shareing-res
jorgeorpinel May 24, 2021
56d2a67
Merge branch 'master' into cases/shareing-res
jorgeorpinel May 31, 2021
6daa0c2
cases: address minor feedback
jorgeorpinel May 31, 2021
d2f2f93
cases: generalize info about basic caching data optimization
jorgeorpinel May 31, 2021
cc6149d
Merge branch 'master' into cases/shareing-res
jorgeorpinel Jun 1, 2021
e5b8dbb
cases: higher level solution intro
jorgeorpinel Jun 1, 2021
87ea9d2
cases: remove md comment... oops
jorgeorpinel Jun 1, 2021
35c69d1
cases: rewrite part about efficient processing
jorgeorpinel Jun 1, 2021
36b06a0
cases: roll back new use case
jorgeorpinel Jun 28, 2021
13e6fea
ref: remove an md ref link
jorgeorpinel Jun 28, 2021
a350e4d
guide: fix links that shouold go to now how-to
jorgeorpinel Jun 29, 2021
23bca98
guide: better intro for shared cache how-to
jorgeorpinel Jun 29, 2021
94c6cbd
cases: remove unnecessary images
jorgeorpinel Jun 29, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions content/docs/sidebar.json
Original file line number Diff line number Diff line change
Expand Up @@ -87,7 +87,7 @@
"slug": "sharing-data-and-model-files"
},
"data-registries",
"shared-development-server"
"sharing-resources-efficiently"
]
},
{
Expand Down Expand Up @@ -131,7 +131,8 @@
"stop-tracking-data",
"update-tracked-data",
"add-deps-or-outs-to-a-stage",
"merge-conflicts"
"merge-conflicts",
"share-a-dvc-cache"
shcheklein marked this conversation as resolved.
Show resolved Hide resolved
]
},
{
Expand Down
79 changes: 79 additions & 0 deletions content/docs/use-cases/sharing-resources-efficiently.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
# Sharing Resources Efficiently

Data science teams need to handle large files, rotate the use og special
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
processors, and minimize data transfers. This involves provisioning and managing
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
resources such as massive on-prem data stores and powerful servers, which can be
expensive and time consuming.

![](/img/shared-server.png) _Data store shared by DVC projects_

DVC projects support different ways to optimize resource utilization in order to
minimize cost and complexity. This can make the difference, for example when:

- Multiple users work on the same shared server, or there's a single computing
environment to run experiments.
- GPU time gets distributed among people or processes for training machine
learning models.
- There's a centralized data storage unit or cluster.

## Shared Storage
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved

Individual DVC projects already use a local data <abbr>cache</abbr> to achieve
near-instantaneous <abbr>workspace</abbr> restoration when switching among
[versions of data](/doc/use-cases/versioning-data-and-model-files), results,
etc. (think **Git for data**).

The cache directory is fully customizable (see `dvc config cache`), including
it's location, so nothing prevents you from having it in a
[location shared](/doc/user-guide/how-to/share-a-dvc-cache) by multiple local
copies of a <abbr>project</abbr>, or even by different projects altogether. This
enables DVC's automatic de-duplication of data files across all projects.

Additionally, optional [remote storage](/doc/command-reference/remote) e.g.
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
Amazon S3 or Azure Blob Storage (managed separately) can be used by multiple
projects to synchronize their caches (see `dvc push` and `dvc pull`).

<!--
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved

## Example: Shared Development Server

You and your colleagues can work in separate directories as usual, and DVC will
handle all your data in the most effective way possible. Let's say you are
cleaning up raw data for later stages:

```dvc
$ dvc add raw
$ dvc run -n clean_data -d cleanup.py -d raw -o clean \
./cleanup.py raw clean
# The data is cached in the shared location.
$ git add raw.dvc dvc.yaml dvc.lock .gitignore
$ git commit -m "cleanup raw data"
$ git push
```

Your colleagues can [checkout](/doc/command-reference/checkout) the data (from
the shared <abbr>cache</abbr>), and have both `raw` and `clean` data files
appear in their workspace without moving anything manually. After this, they
could decide to continue building this [pipeline](/doc/command-reference/dag)
and process the clean data:

```dvc
$ git pull
$ dvc checkout
A raw # Data is linked from cache to workspace.
$ dvc run -n process_clean_data -d process.py -d clean -o processed
./process.py clean processed
$ git add dvc.yaml dvc.lock
$ git commit -m "process clean data"
$ git push
```

And now you can just as easily make their work appear in your workspace with:

```dvc
$ git pull
$ dvc checkout
A processed
```

-->
Original file line number Diff line number Diff line change
@@ -1,13 +1,21 @@
# Shared Development Server
---
title: 'How to Share a Cache Among Projects'
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
shcheklein marked this conversation as resolved.
Show resolved Hide resolved
description: >-
Setup a single cache for different projects or distributed copies of a same
project.
---

Some teams may prefer using a single shared machine to run their experiments.
This allows better resource utilization, such as GPU access, centralized data
storage, etc. With DVC, you can easily setup shared data store on a server with
multiple users or processes. This enables near-instantaneous
<abbr>workspace</abbr> restoration and switching speeds for everyone – a
**checkout for data**.
# How to Share a DVC Cache

![](/img/shared-server.png) _Data store shared by DVC projects_
There are 2 main reasons to setup a shared <abbr>DVC cache</abbr>:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both items are pretty hard to read to be honest

so, we have a few cases people would use this:

  • one large machine (e.g. with multiple GPUs) and one storage on it and people do the same or multiple projects and we want to avoid duplication (no copies) and save time (no copy)
  • one large NAS () and people attach it to their machines - again no copies - fast, doesn't take extra space, fits even if there are no space (ability to work with really large data)

what else am I missing? can we make description more explicit/simpler?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Updated, PTAL.


1. You have distributed copies of a DVC repository in a single shared server
with multiple users. A shared cache is necessary to avoid duplicating the
project's data on the single local storage available to all.
2. Your team works with multiple projects in environments with limited storage,
which share a large storage unit. Everyone needs to use the shared drive
anyway, and combining the cache locations will also prevent data duplication
(across projects).

## Preparation
shcheklein marked this conversation as resolved.
Show resolved Hide resolved

Expand Down Expand Up @@ -77,44 +85,3 @@ If you're using Git, commit the changes to your project's config file (usually
$ git add .dvc/config
$ git commit -m "config external/shared DVC cache"
```

## Examples
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved

You and your colleagues can work in your own separate <abbr>workspaces</abbr> as
usual, and DVC will handle all your data in the most effective way possible.
Let's say you are cleaning up raw data for later stages:

```dvc
$ dvc add raw
$ dvc run -n clean_data -d cleanup.py -d raw -o clean \
./cleanup.py raw clean
# The data is cached in the shared location.
$ git add raw.dvc dvc.yaml dvc.lock .gitignore
$ git commit -m "cleanup raw data"
$ git push
```

Your colleagues can [checkout](/doc/command-reference/checkout) the
<abbr>project</abbr> data (from the shared <abbr>cache</abbr>), and have both
`raw` and `clean` data files appear in their workspace without moving anything
manually. After this, they could decide to continue building this
[pipeline](/doc/command-reference/dag) and process the clean data:

```dvc
$ git pull
$ dvc checkout
A raw # Data is linked from cache to workspace.
$ dvc run -n process_clean_data -d process.py -d clean -o processed
./process.py clean processed
$ git add dvc.yaml dvc.lock
$ git commit -m "process clean data"
$ git push
```

And now you can just as easily make their work appear in your workspace with:

```dvc
$ git pull
$ dvc checkout
A processed
```
1 change: 1 addition & 0 deletions redirects-list.json
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@
"^/doc/tutorials(/.*)? /doc/start",

"^/doc/use-cases/data-and-model-files-versioning/?$ /doc/use-cases/versioning-data-and-model-files",
"^/doc/doc/use-cases/shared-development-server$ /doc/use-cases/sharing-resources-efficiently",
"^/doc/user-guide/updating-tracked-files$ /doc/user-guide/how-to/update-tracked-data",
"^/doc/user-guide/how-to/update-tracked-files$ /doc/user-guide/how-to/update-tracked-data",
"^/doc/user-guide/merge-conflicts$ /doc/user-guide/how-to/merge-conflicts",
Expand Down