diff --git a/content/docs/use-cases/shared-development-server.md b/content/docs/use-cases/shared-development-server.md index 4827163ae4..f6b28056a1 100644 --- a/content/docs/use-cases/shared-development-server.md +++ b/content/docs/use-cases/shared-development-server.md @@ -3,102 +3,110 @@ Some teams may prefer using one single shared machine to run their experiments. This allows better resource utilization, such as the ability to use multiple GPUs, centralized data storage, etc. With DVC, you can easily setup shared data -storage on a server accessed by several users, in a way that enables almost -instantaneous workspace restoration/switching speed for everyone – -similar to `git checkout` for your code. +storage on a server accessed by several users or for any other reason, in a way +that enables almost instantaneous workspace restoration/switching +speed for everyone – similar to `git checkout` for your code. ![](/img/shared-server.png) ## Preparation -Create a shared directory to be used as the cache location for -everyone's DVC projects, so that all your colleagues can use the -same project cache: +Create a directory external to your DVC projects to be used as a +shared cache location for everyone's projects: ```dvc -$ mkdir -p /path/to/dvc-cache +$ mkdir -p /home/shared/dvc-cache ``` -You will have to make sure that the directory has proper permissions setup, so -that all your colleagues can read and write to it, and can access cache files -written by others. The most straightforward way to do this is to make sure that -everyone's users are members of the same group, and that your shared cache -directory is owned by this group, with the aforementioned permissions. +> The `/home/shared` directory used as example above is typical in Linux +> distributions. -## Transfer existing cache (Optional) +Make sure that the directory has proper permissions, so that all your colleagues +can write to it, and can read cached files written by others. The most +straightforward way to do this is to make all users members of the same group, +and have the shared cache directory owned by that group. -This step is optional. You can skip it if you are setting up a new DVC project -whose cache directory is not stored in the default location, `.dvc/cache`. If -you did work on your project with DVC previously and you wish to transfer your -cache to the shared cache directory (external to your workspace), you will need -to simply move it from an old cache location to the new one: +## Transfer existing cache (optional) + +You can skip this part if you are setting up a new DVC project where the local +cache directory (`.dvc/cache` by default), hasn't been used. + +If you did work on the DVC projects previously and wish to transfer +its existing cache to the shared cache directory, you will simply need to move +its contents from the old location to the new one: ```dvc -$ mv .dvc/cache/* /path/to/dvc-cache +$ mv .dvc/cache/* /home/shared/dvc-cache ``` -Now you need to ensure that cache files/directories have appropriate -permissions, so that they could be accessed by your colleagues that are members -of the same group: +Now, ensure that the cached directories and files have appropriate permissions, +so that they can be accessed by your colleagues (assuming their users are +members of the same group): ```dvc -$ sudo find /path/to/dvc-cache -type f -exec chmod 0664 {} \; -$ sudo find /path/to/dvc-cache -type d -exec chmod 0775 {} \; -$ sudo chown -R myuser:ourgroup /path/to/dvc-cache/ +$ sudo find /home/shared/dvc-cache -type d -exec chmod 0775 {} \; +$ sudo find /home/shared/dvc-cache -type f -exec chmod 0664 {} \; +$ sudo chown -R myuser:ourgroup /home/shared/dvc-cache/ ``` -## Configure shared cache +## Configure the external shared cache -Tell DVC to use the directory we've set up above as an shared cache location by -running: +Tell DVC to use the directory we've set up above as the cache for +your project: ```dvc -$ dvc config cache.dir /path/to/dvc-cache +$ dvc config cache.dir /home/shared/dvc-cache ``` -And tell DVC to set group permissions on the newly created/downloaded cache +And tell DVC to set group permissions on newly created or downloaded cache files: ```dvc $ dvc config cache.shared group ``` -Commit changes to `.dvc/config` and push them to your git remote: +> See `dvc config cache` for more information on these config options. + +If you're using Git, commit changes to your project's config file (`.dvc/config` +by default): ```dvc $ git add .dvc/config -$ git commit -m "dvc: shared external cache dir" +$ git commit -m "config external/shared DVC cache" ``` ## Examples You and your colleagues can work in your own separate workspaces as usual, and DVC will handle all your data in the most effective way possible. -Let's say you are cleaning up the data: +Let's say you are cleaning up raw data for later stages: ```dvc $ dvc add raw $ dvc run -d raw -o clean ./cleanup.py raw clean + # The data is cached in the shared location. $ git add raw.dvc clean.dvc $ git commit -m "cleanup raw data" $ git push ``` -Your colleagues can [checkout](/doc/command-reference/checkout) the project -data, and have both `raw` and `clean` data files appear in their workspace -without moving anything manually. After this, they could decide to continue -building this pipeline and process the cleaned up data: +Your colleagues can [checkout](/doc/command-reference/checkout) the +project data (from the shared cache), and have both +`raw` and `clean` data files appear in their workspace without moving anything +manually. After this, they could decide to continue building this +[pipeline](/doc/command-reference/pipeline) and process the clean data: ```dvc $ git pull $ dvc checkout + # Data is linked from cache to workspace. $ dvc run -d clean -o processed ./process.py clean process $ git add processed.dvc $ git commit -m "process clean data" $ git push ``` -And now you can just as easily make their work appear in your workspace by: +And now you can just as easily make their work appear in your workspace with: ```dvc $ git pull diff --git a/content/docs/user-guide/managing-external-data.md b/content/docs/user-guide/managing-external-data.md index 9dbd2e2e0f..49997a8437 100644 --- a/content/docs/user-guide/managing-external-data.md +++ b/content/docs/user-guide/managing-external-data.md @@ -4,20 +4,20 @@ There are cases when data is so large, or its processing is organized in a way that you would like to avoid moving it out of its external/remote location. For example from a network attached storage (NAS) drive, processing data on HDFS, running [Dask](https://dask.org/) via SSH, or having a script that streams data -from S3 to process it. A mechanism for external outputs and -[external dependencies](/doc/user-guide/external-dependencies) provides a way -for DVC to control data externally. +from S3 to process it. External outputs and +[external dependencies](/doc/user-guide/external-dependencies) provide a way for +DVC to control data outside of the project directory. ## Description DVC can track files on an external storage with `dvc add` or specify external -files as outputs for [DVC-files](/doc/user-guide/dvc-file-format) created by -`dvc run` (stage files). External outputs are considered part of the DVC -project. DVC will track changes in them and reflect this in the output of +files as outputs for [DVC-files](/doc/user-guide/dvc-file-format) +created by `dvc run` (stage files). External outputs are considered part of the +DVC project. DVC will track changes in them and reflect this in the output of `dvc status`. -Currently, the following types (protocols) of external outputs (and cache) are -supported: +Currently, the following types (protocols) of external outputs (and +cache) are supported: - Local files and directories outside of your workspace - SSH @@ -29,22 +29,22 @@ supported: > `dvc remote`. In order to specify an external output for a stage file, use the usual `-o` or -`-O` options of the `dvc run` command, but with the external path or URL to the -file in question. For cached external outputs (`-o`) you will need -to [setup an external cache](/doc/command-reference/config#cache) in the same -remote location. Non-cached external outputs (`-O`) do not require an external -cache to be setup. +`-O` options of `dvc run`, but with the external path or URL to the file in +question. For cached external outputs (`-o`) you will need to +[setup an external cache](/doc/use-cases/shared-development-server#configure-the-external-shared-cache) +in the same external/remote file system first. -> Avoid using the same remote location that you are using for `dvc push`, -> `dvc pull`, `dvc fetch` as external cache for your external outputs, because -> it may cause possible file hash overlaps: The hash value of a data file in -> external storage could collide with that generated locally for another file. +> Avoid using the same location of the +> [remote storage](/doc/command-reference/remote) that you have for `dvc push` +> and `dvc pull` for external outputs or as external cache, because it may cause +> file hash overlaps: The hash value of a data file in external storage could +> collide with the one generated locally for another file. ## Examples For the examples, let's take a look at a [stage](/doc/command-reference/run) that simply moves local file to an external location, producing a `data.txt.dvc` -stage file (DVC-file). +DVC-file. > Note that some of these commands use the `/home/shared` directory, typical in > Linux distributions.