Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regular updates (Apr 6) #1110

Merged
merged 8 commits into from
Apr 7, 2020
84 changes: 46 additions & 38 deletions content/docs/use-cases/shared-development-server.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,102 +3,110 @@
Some teams may prefer using one single shared machine to run their experiments.
This allows better resource utilization, such as the ability to use multiple
GPUs, centralized data storage, etc. With DVC, you can easily setup shared data
storage on a server accessed by several users, in a way that enables almost
instantaneous <abbr>workspace</abbr> restoration/switching speed for everyone –
similar to `git checkout` for your code.
storage on a server accessed by several users or for any other reason, in a way
that enables almost instantaneous <abbr>workspace</abbr> restoration/switching
speed for everyone – similar to `git checkout` for your code.

![](/img/shared-server.png)

## Preparation

Create a shared directory to be used as the <abbr>cache</abbr> location for
everyone's <abbr>DVC projects</abbr>, so that all your colleagues can use the
same project cache:
Create a directory external to your <abbr>DVC projects</abbr> to be used as a
shared <abbr>cache</abbr> location for everyone's projects:

```dvc
$ mkdir -p /path/to/dvc-cache
$ mkdir -p /home/shared/dvc-cache
```

You will have to make sure that the directory has proper permissions setup, so
that all your colleagues can read and write to it, and can access cache files
written by others. The most straightforward way to do this is to make sure that
everyone's users are members of the same group, and that your shared cache
directory is owned by this group, with the aforementioned permissions.
> The `/home/shared` directory used as example above is typical in Linux
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
> distributions.

## Transfer existing cache (Optional)
Make sure that the directory has proper permissions, so that all your colleagues
can write to it, and can read cached files written by others. The most
straightforward way to do this is to make all users members of the same group,
and have the shared cache directory owned by that group.

This step is optional. You can skip it if you are setting up a new DVC project
whose cache directory is not stored in the default location, `.dvc/cache`. If
you did work on your project with DVC previously and you wish to transfer your
cache to the shared cache directory (external to your workspace), you will need
to simply move it from an old cache location to the new one:
## Transfer existing cache (optional)

You can skip this part if you are setting up a new DVC project where the local
<abbr>cache directory</abbr> (`.dvc/cache` by default), hasn't been used.

If you did work on the <abbr>DVC projects</abbr> previously and wish to transfer
its existing cache to the shared cache directory, you will simply need to move
its contents from the old location to the new one:

```dvc
$ mv .dvc/cache/* /path/to/dvc-cache
$ mv .dvc/cache/* /home/shared/dvc-cache
```

Now you need to ensure that cache files/directories have appropriate
permissions, so that they could be accessed by your colleagues that are members
of the same group:
Now, ensure that the cached directories and files have appropriate permissions,
so that they can be accessed by your colleagues (assuming their users are
members of the same group):

```dvc
$ sudo find /path/to/dvc-cache -type f -exec chmod 0664 {} \;
$ sudo find /path/to/dvc-cache -type d -exec chmod 0775 {} \;
$ sudo chown -R myuser:ourgroup /path/to/dvc-cache/
$ sudo find /home/shared/dvc-cache -type d -exec chmod 0775 {} \;
$ sudo find /home/shared/dvc-cache -type f -exec chmod 0664 {} \;
$ sudo chown -R myuser:ourgroup /home/shared/dvc-cache/
Comment on lines +39 to +46
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure these commands will always work. What if the shared cache dir has some files your user can't access (if someone didn't do this process before, for example)? Wouldn't the command fail?

```

## Configure shared cache
## Configure the external shared cache

Tell DVC to use the directory we've set up above as an shared cache location by
running:
Tell DVC to use the directory we've set up above as the <abbr>cache</abbr> for
your <abbr>project</abbr>:

```dvc
$ dvc config cache.dir /path/to/dvc-cache
$ dvc config cache.dir /home/shared/dvc-cache
```
Comment on lines +49 to 56
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have all these exact and reproducible command guides in a use case? This section seems more like a user guide to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I agree ... most of it should be part of the shared cache article in the UG. And use case should be probably more generic "Optimizing data management" which can cover this one (multiple people working on a single box), k8s scenario - multiple machines + NAS, etc. We are long overdue on this. There were contributors that made an attempt, but didn't get it right.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I see we have #429 for something like this already. And #986 is also related. Should we try to consolidate them into a single issue and include the comments from this review?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, these two mostly about improving the existing case - but they are relevant, up to you to consolidate or not ... there should be tickets related to this from other angles - shared cache on NAS/NFS for example

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some other related links:

https://discuss.dvc.org/t/setup-dvc-to-work-with-shared-data-on-nas-server/180/4
#732
#455
#563

and others ... someone should a first step systematize and suggest the way to structure this properly


And tell DVC to set group permissions on the newly created/downloaded cache
And tell DVC to set group permissions on newly created or downloaded cache
files:

```dvc
$ dvc config cache.shared group
```

Commit changes to `.dvc/config` and push them to your git remote:
> See `dvc config cache` for more information on these config options.

If you're using Git, commit changes to your project's config file (`.dvc/config`
by default):

```dvc
$ git add .dvc/config
$ git commit -m "dvc: shared external cache dir"
$ git commit -m "config external/shared DVC cache"
```

## Examples

You and your colleagues can work in your own separate <abbr>workspaces</abbr> as
usual, and DVC will handle all your data in the most effective way possible.
Let's say you are cleaning up the data:
Let's say you are cleaning up raw data for later stages:

```dvc
$ dvc add raw
$ dvc run -d raw -o clean ./cleanup.py raw clean
# The data is cached in the shared location.
$ git add raw.dvc clean.dvc
$ git commit -m "cleanup raw data"
$ git push
```

Your colleagues can [checkout](/doc/command-reference/checkout) the project
data, and have both `raw` and `clean` data files appear in their workspace
without moving anything manually. After this, they could decide to continue
building this pipeline and process the cleaned up data:
Your colleagues can [checkout](/doc/command-reference/checkout) the
<abbr>project</abbr> data (from the shared <abbr>cache</abbr>), and have both
`raw` and `clean` data files appear in their workspace without moving anything
manually. After this, they could decide to continue building this
[pipeline](/doc/command-reference/pipeline) and process the clean data:

```dvc
$ git pull
$ dvc checkout
# Data is linked from cache to workspace.
$ dvc run -d clean -o processed ./process.py clean process
$ git add processed.dvc
$ git commit -m "process clean data"
$ git push
```

And now you can just as easily make their work appear in your workspace by:
And now you can just as easily make their work appear in your workspace with:

```dvc
$ git pull
Expand Down
36 changes: 18 additions & 18 deletions content/docs/user-guide/managing-external-data.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,20 +4,20 @@ There are cases when data is so large, or its processing is organized in a way
that you would like to avoid moving it out of its external/remote location. For
example from a network attached storage (NAS) drive, processing data on HDFS,
running [Dask](https://dask.org/) via SSH, or having a script that streams data
from S3 to process it. A mechanism for external outputs and
[external dependencies](/doc/user-guide/external-dependencies) provides a way
for DVC to control data externally.
from S3 to process it. External outputs and
[external dependencies](/doc/user-guide/external-dependencies) provide a way for
DVC to control data outside of the <abbr>project</abbr> directory.

## Description

DVC can track files on an external storage with `dvc add` or specify external
files as outputs for [DVC-files](/doc/user-guide/dvc-file-format) created by
`dvc run` (stage files). External outputs are considered part of the <abbr>DVC
project</abbr>. DVC will track changes in them and reflect this in the output of
files as <abbr>outputs</abbr> for [DVC-files](/doc/user-guide/dvc-file-format)
created by `dvc run` (stage files). External outputs are considered part of the
DVC project. DVC will track changes in them and reflect this in the output of
`dvc status`.

Currently, the following types (protocols) of external outputs (and cache) are
supported:
Currently, the following types (protocols) of external outputs (and
<abbr>cache</abbr>) are supported:

- Local files and directories outside of your <abbr>workspace</abbr>
- SSH
Expand All @@ -29,22 +29,22 @@ supported:
> `dvc remote`.

In order to specify an external output for a stage file, use the usual `-o` or
`-O` options of the `dvc run` command, but with the external path or URL to the
file in question. For <abbr>cached</abbr> external outputs (`-o`) you will need
to [setup an external cache](/doc/command-reference/config#cache) in the same
remote location. Non-cached external outputs (`-O`) do not require an external
cache to be setup.
`-O` options of `dvc run`, but with the external path or URL to the file in
question. For <abbr>cached</abbr> external outputs (`-o`) you will need to
[setup an external cache](/doc/use-cases/shared-development-server#configure-the-external-shared-cache)
in the same external/remote file system first.

> Avoid using the same remote location that you are using for `dvc push`,
> `dvc pull`, `dvc fetch` as external cache for your external outputs, because
> it may cause possible file hash overlaps: The hash value of a data file in
> external storage could collide with that generated locally for another file.
> Avoid using the same location of the
> [remote storage](/doc/command-reference/remote) that you have for `dvc push`
> and `dvc pull` for external outputs or as external cache, because it may cause
> file hash overlaps: The hash value of a data file in external storage could
> collide with the one generated locally for another file.
Comment on lines -38 to +41
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improved this note but it's still pretty strange. Difficult to understand what it is about. The ambiguity of the term "remote" is giving us problems here. Not sure how to address, maybe just remove the note?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, it's a very complicated to simplify it :) the whole topic is very advanced and very technical. Let's keep it for now with your revisions and get back to it when we have more time to spend on all this data management problems.


## Examples

For the examples, let's take a look at a [stage](/doc/command-reference/run)
that simply moves local file to an external location, producing a `data.txt.dvc`
stage file (DVC-file).
DVC-file.

> Note that some of these commands use the `/home/shared` directory, typical in
> Linux distributions.
Expand Down