Skip to content

Commit

Permalink
use-cases: update shared dev server case, better explain "external" a…
Browse files Browse the repository at this point in the history
…nd "shared" cache
  • Loading branch information
jorgeorpinel committed Apr 7, 2020
1 parent 774edc7 commit 030dfc5
Show file tree
Hide file tree
Showing 2 changed files with 64 additions and 56 deletions.
84 changes: 46 additions & 38 deletions content/docs/use-cases/shared-development-server.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,102 +3,110 @@
Some teams may prefer using one single shared machine to run their experiments.
This allows better resource utilization, such as the ability to use multiple
GPUs, centralized data storage, etc. With DVC, you can easily setup shared data
storage on a server accessed by several users, in a way that enables almost
instantaneous <abbr>workspace</abbr> restoration/switching speed for everyone –
similar to `git checkout` for your code.
storage on a server accessed by several users or for any other reason, in a way
that enables almost instantaneous <abbr>workspace</abbr> restoration/switching
speed for everyone – similar to `git checkout` for your code.

![](/img/shared-server.png)

## Preparation

Create a shared directory to be used as the <abbr>cache</abbr> location for
everyone's <abbr>DVC projects</abbr>, so that all your colleagues can use the
same project cache:
Create a directory external to your <abbr>DVC projects</abbr> to be used as a
shared <abbr>cache</abbr> location for everyone's projects:

```dvc
$ mkdir -p /path/to/dvc-cache
$ mkdir -p /home/shared/dvc-cache
```

You will have to make sure that the directory has proper permissions setup, so
that all your colleagues can read and write to it, and can access cache files
written by others. The most straightforward way to do this is to make sure that
everyone's users are members of the same group, and that your shared cache
directory is owned by this group, with the aforementioned permissions.
> The `/home/shared` directory used as example above is typical in Linux
> distributions.
## Transfer existing cache (Optional)
Make sure that the directory has proper permissions, so that all your colleagues
can write to it, and can read cached files written by others. The most
straightforward way to do this is to make all users members of the same group,
and have the shared cache directory owned by that group.

This step is optional. You can skip it if you are setting up a new DVC project
whose cache directory is not stored in the default location, `.dvc/cache`. If
you did work on your project with DVC previously and you wish to transfer your
cache to the shared cache directory (external to your workspace), you will need
to simply move it from an old cache location to the new one:
## Transfer existing cache (optional)

You can skip this part if you are setting up a new DVC project where the local
<abbr>cache directory</abbr> (`.dvc/cache` by default), hasn't been used.

If you did work on the <abbr>DVC projects</abbr> previously and wish to transfer
its existing cache to the shared cache directory, you will simply need to move
its contents from the old location to the new one:

```dvc
$ mv .dvc/cache/* /path/to/dvc-cache
$ mv .dvc/cache/* /home/shared/dvc-cache
```

Now you need to ensure that cache files/directories have appropriate
permissions, so that they could be accessed by your colleagues that are members
of the same group:
Now, ensure that the cached directories and files have appropriate permissions,
so that they can be accessed by your colleagues (assuming their users are
members of the same group):

```dvc
$ sudo find /path/to/dvc-cache -type f -exec chmod 0664 {} \;
$ sudo find /path/to/dvc-cache -type d -exec chmod 0775 {} \;
$ sudo chown -R myuser:ourgroup /path/to/dvc-cache/
$ sudo find /home/shared/dvc-cache -type d -exec chmod 0775 {} \;
$ sudo find /home/shared/dvc-cache -type f -exec chmod 0664 {} \;
$ sudo chown -R myuser:ourgroup /home/shared/dvc-cache/
```

## Configure shared cache
## Configure the external shared cache

Tell DVC to use the directory we've set up above as an shared cache location by
running:
Tell DVC to use the directory we've set up above as the <abbr>cache</abbr> for
your <abbr>project</abbr>:

```dvc
$ dvc config cache.dir /path/to/dvc-cache
$ dvc config cache.dir /home/shared/dvc-cache
```

And tell DVC to set group permissions on the newly created/downloaded cache
And tell DVC to set group permissions on newly created or downloaded cache
files:

```dvc
$ dvc config cache.shared group
```

Commit changes to `.dvc/config` and push them to your git remote:
> See `dvc config cache` for more information on these config options.
If you're using Git, commit changes to your project's config file (`.dvc/config`
by default):

```dvc
$ git add .dvc/config
$ git commit -m "dvc: shared external cache dir"
$ git commit -m "config external/shared DVC cache"
```

## Examples

You and your colleagues can work in your own separate <abbr>workspaces</abbr> as
usual, and DVC will handle all your data in the most effective way possible.
Let's say you are cleaning up the data:
Let's say you are cleaning up raw data for later stages:

```dvc
$ dvc add raw
$ dvc run -d raw -o clean ./cleanup.py raw clean
# The data is cached in the shared location.
$ git add raw.dvc clean.dvc
$ git commit -m "cleanup raw data"
$ git push
```

Your colleagues can [checkout](/doc/command-reference/checkout) the project
data, and have both `raw` and `clean` data files appear in their workspace
without moving anything manually. After this, they could decide to continue
building this pipeline and process the cleaned up data:
Your colleagues can [checkout](/doc/command-reference/checkout) the
<abbr>project</abbr> data (from the shared <abbr>cache</abbr>), and have both
`raw` and `clean` data files appear in their workspace without moving anything
manually. After this, they could decide to continue building this
[pipeline](/doc/command-reference/pipeline) and process the clean data:

```dvc
$ git pull
$ dvc checkout
# Data is linked from cache to workspace.
$ dvc run -d clean -o processed ./process.py clean process
$ git add processed.dvc
$ git commit -m "process clean data"
$ git push
```

And now you can just as easily make their work appear in your workspace by:
And now you can just as easily make their work appear in your workspace with:

```dvc
$ git pull
Expand Down
36 changes: 18 additions & 18 deletions content/docs/user-guide/managing-external-data.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,20 +4,20 @@ There are cases when data is so large, or its processing is organized in a way
that you would like to avoid moving it out of its external/remote location. For
example from a network attached storage (NAS) drive, processing data on HDFS,
running [Dask](https://dask.org/) via SSH, or having a script that streams data
from S3 to process it. A mechanism for external outputs and
[external dependencies](/doc/user-guide/external-dependencies) provides a way
for DVC to control data externally.
from S3 to process it. External outputs and
[external dependencies](/doc/user-guide/external-dependencies) provide a way for
DVC to control data outside of the <abbr>project</abbr> directory.

## Description

DVC can track files on an external storage with `dvc add` or specify external
files as outputs for [DVC-files](/doc/user-guide/dvc-file-format) created by
`dvc run` (stage files). External outputs are considered part of the <abbr>DVC
project</abbr>. DVC will track changes in them and reflect this in the output of
files as <abbr>outputs</abbr> for [DVC-files](/doc/user-guide/dvc-file-format)
created by `dvc run` (stage files). External outputs are considered part of the
DVC project. DVC will track changes in them and reflect this in the output of
`dvc status`.

Currently, the following types (protocols) of external outputs (and cache) are
supported:
Currently, the following types (protocols) of external outputs (and
<abbr>cache</abbr>) are supported:

- Local files and directories outside of your <abbr>workspace</abbr>
- SSH
Expand All @@ -29,22 +29,22 @@ supported:
> `dvc remote`.
In order to specify an external output for a stage file, use the usual `-o` or
`-O` options of the `dvc run` command, but with the external path or URL to the
file in question. For <abbr>cached</abbr> external outputs (`-o`) you will need
to [setup an external cache](/doc/command-reference/config#cache) in the same
remote location. Non-cached external outputs (`-O`) do not require an external
cache to be setup.
`-O` options of `dvc run`, but with the external path or URL to the file in
question. For <abbr>cached</abbr> external outputs (`-o`) you will need to
[setup an external cache](/doc/use-cases/shared-development-server#configure-the-external-shared-cache)
in the same external/remote file system first.

> Avoid using the same remote location that you are using for `dvc push`,
> `dvc pull`, `dvc fetch` as external cache for your external outputs, because
> it may cause possible file hash overlaps: The hash value of a data file in
> external storage could collide with that generated locally for another file.
> Avoid using the same location of the
> [remote storage](/doc/command-reference/remote) that you have for `dvc push`
> and `dvc pull` for external outputs or as external cache, because it may cause
> file hash overlaps: The hash value of a data file in external storage could
> collide with the one generated locally for another file.
## Examples

For the examples, let's take a look at a [stage](/doc/command-reference/run)
that simply moves local file to an external location, producing a `data.txt.dvc`
stage file (DVC-file).
DVC-file.

> Note that some of these commands use the `/home/shared` directory, typical in
> Linux distributions.
Expand Down

0 comments on commit 030dfc5

Please sign in to comment.