Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data sharing scenarios #784

Closed
wants to merge 3 commits into from
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 23 additions & 0 deletions src/Documentation/sidebar.json
Original file line number Diff line number Diff line change
Expand Up @@ -119,6 +119,29 @@
"label": "Managing External Data",
"slug": "managing-external-data"
},
{
"label": "Data Sharing",
"slug": "data-sharing",
"source": "data-sharing/index.md",
"children": [
{
"label": "Remote DVC Storage",
"slug": "remote-storage"
},
{
"label": "Shared Development Server",
dashohoxha marked this conversation as resolved.
Show resolved Hide resolved
"slug": "shared-server"
},
{
"label": "Mounted DVC Storage",
"slug": "mounted-storage"
},
{
"label": "Synced DVC Storage",
"slug": "synced-storage"
}
]
},
{
"label": "Contributing",
"slug": "contributing",
Expand Down
43 changes: 43 additions & 0 deletions static/docs/user-guide/data-sharing/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# Data Sharing and Collaboration with DVC

Like Git, DVC facilitates collaboration and data sharing on a distributed
environment. It makes it easy to consistently get all your data files and
directories to any machine, along with matching source code.

![](/static/img/model-sharing-digram.png)

There are several ways to setup data sharing with DVC. We will discuss the most
common scenarios.

- [Sharing Data Through a Remote DVC Storage](/doc/user-guide/data-sharing/remote-storage)

This is the recommended and the most common case of data sharing. In this case
we setup a [remote storage](/doc/command-reference/remote) on a data storage
provider, to store data files online, where others can reach them. Currently
DVC supports Amazon S3, Google Cloud Storage, Microsoft Azure Blob Storage,
SSH, HDFS, and other remote locations, and the list is constantly growing.

- [Using Local Storage on a Shared Development Server](/doc/user-guide/data-sharing/shared-server)

Some teams may prefer using a single shared machine to run their experiments.
This allows them to have better resource utilization such as the ability to
use multiple GPUs, etc. In this case we can use a local data storage, which
allows the team to store and share data very efficiently, with no duplication
of data files and instantaneous transfer.

- [Sharing Data Through a Mounted DVC Storage](/doc/user-guide/data-sharing/mounted-storage)

If the data storage server (or provider) has a protocol that is not supported
yet by DVC, but it allows us to mount a remote directory on the local
filesystem, then we can still make a setup for data sharing with DVC. This
case might be useful for example when the data files are located on a
network-attached storage (NAS) and can be accessed through protocols like NFS,
Samba, SSHFS, etc.

- [Sharing Data Through a Synchronized DVC Storage](/doc/user-guide/data-sharing/synched-storage)

There are cloud data storage providers that are not supported yet by DVC. But
this does not mean that we cannot use them to share data with the help of DVC.
If it is possible to synchronize a local directory with a remote one (which is
supported by almost all storage providers), then we are good to go. We can
make a setup that allows us to share DVC data.
115 changes: 115 additions & 0 deletions static/docs/user-guide/data-sharing/mounted-storage.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
# Sharing Data Through a Mounted DVC Storage
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, see my other comments. There are at least two possibilities - shared cache or shared remote. In case of NAS it's actually beneficial to share cache (and use some regular cloud remote to still do backups).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of NAS it's actually beneficial to share cache (and use some regular cloud remote to still do backups).

Sharing cache in the case of a NAS may cause problems when we try to use dvc gc. I remember seeing some discussions about this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and we even introduced a special flag - to pass multiple projects at once to dvc gc. Gc in DVC is a big pain still but it does not change the fact I mentioned above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we even introduced a special flag - to pass multiple projects at once to dvc gc

The option -p, --projects of dvc gc gets a path to a project (at least this is how I understand the man page, I have never tried it).

In the case of a NAS mounted storage I assume that the collaborating projects are located on different machines, isn't it? So, the option -p, --projects cannot be used in this case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are probably ways still to run GC - clone projects to a single machine. It's not ideal (all about GC not) but it's a maintenance operations vs day to day workflow that is being optimized with links if you share cache vs sharing remote directly.

Also, I think it is the same problem with other your cases and in one of them it's about people sharing the same machine, right?

Copy link
Member

@shcheklein shcheklein Nov 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may extend the tutorial and the user-guide page to explain this optimization as well.

I would first understand the options in terms of organizing data, understand which of them are more general then others, then would try to come up with a couple of sections that explain them in a general way. And by general I mean concepts like - cache is shared or not? people use a single machine or not? etc

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how my initial concern is resolved or addressed here. Please 🙏 , don't resolve them on your own - it makes it extremely hard to do reviews (check and follow up the previously raised concerns).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be precise - I don't see much value in three (two?) sections that explain different variations of the mounted remote. And to be even more precise - I haven't see a mounted share remote case. The only benefit I see - unsupported storage type. I would just create a How to or FAQ or something with one-two paragraphs explanation on options - mount, use rsync/rclone, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how my initial concern is resolved or addressed here.

I have added this page:

and this interactive example:

that explain the case of mounted cache, which is more efficient if we share data through a NAS (with caveat of being careful with the command dvc gc).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not answer my question unless again I'm missing the whole point of this PR. Could you please elaborate on this:

To be precise - I don't see much value in three (two?) sections that explain different variations of the mounted remote. And to be even more precise - I haven't see a mounted share remote case. The only benefit I see - unsupported storage type. I would just create a How to or FAQ or something with one-two paragraphs explanation on options - mount, use rsync/rclone, etc.


If the data storage server (or provider) has a protocol that is not supported
yet by DVC, but it allows us to mount a remote directory on the local
filesystem, then we can still make a setup for data sharing with DVC.

This case might be useful when the data files are located on a network-attached
storage (NAS), for example, and can be accessed through protocols like NFS,
Samba, SSHFS, etc.

## SSHFS Mounted Storage Example

In this example we will see how to share data with the help of a storage
directory that is mounted through SSHFS. Normally we don't need to do this,
since we can
[use a SSH remote storage](https://katacoda.com/dvc/courses/examples/ssh-storage)
directly. But we are using it just as an example, since it is easy to
network-mount a directory with SSHFS. Once you understand how it works, it
should be easy to implement it for other types of mounted storages (like NFS,
Samba, etc.).

> For more detailed instructions check out this
> [interactive example](https://katacoda.com/dvc/courses/examples/mounted-storage).

<p align="center">
<img src="/static/img/user-guide/data-sharing/mounted-storage.png"/>
</p>

### Setup the server

We have to do these configurations on the SSH server:

- Create accounts for each user and add them to groups for accessing the Git
repository and the DVC storage.
- Create a bare git repository (for example on `/srv/project.git/`) and an empty
directory for the DVC storage (for example on `/srv/project.cache/`).

- Grant users read/write access to these directories (through the groups).

### Setup each user

When we have to access a SSH server, we definitely want to generate ssh key
pairs and setup the SSH config so that we can access the server without a
password.

Let's assume that for each user we can use the private ssh key
`~/.ssh/dvc-server` to access the server without a password, and we have also
added on `~/.ssh/config` lines like these:

```
Host dvc-server
HostName host01
User user1
IdentityFile ~/.ssh/dvc-server
IdentitiesOnly yes
```

Here `dvc-server` is the name or alias that we can use for our server, `host01`
can actually be the IP or the FQDN of the server, and `user1` is the username of
the first user on the server.

### Setup the DVC storage

First of all we have to mount the remote storage directory to a local directory.
With SSHFS (and the SSH configuration on the section above) it is as simple as
this:

```dvc
$ mkdir ~/project.cache
$ sshfs \
dvc-server:/srv/project.cache \
~/project.cache
```

Once it is mounted, the default storage configuration of the project can be done
like this:

```dvc
$ dvc remote add --local --default \
mounted-cache $HOME/project.cache
$ dvc remote list --local
mounted-cache /home/username/project.cache
```

Note that this configuration is specific for each user, so we have used the
`--local` option in order to save it on `.dvc/config.local`, which is ignored by
Git. Now this configuration file should have a content like this:

```
['remote "mounted-cache"']
url = /home/username/project.cache
[core]
remote = mounted-cache
```

### Sharing data

After adding data to the project with `dvc add` and `dvc run`, it is stored in
`.dvc/cache`. We can push both the code changes and the data like this:

```dvc
$ git push
$ dvc push
```

The command `dvc push` copies the cached files from `.dvc/cache/` to
`~/project.cache/`. However, since this is a mounted directory, the cached files
are immediately copied to the server as well, and they become available on the
mounted directories of the other users. So, all the other users have to do in
order to receive the code changes and the data files is this:

```dvc
$ git pull
$ dvc pull
```
193 changes: 193 additions & 0 deletions static/docs/user-guide/data-sharing/remote-storage.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,193 @@
# Sharing Data Through a Remote DVC Storage
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is strange to see that it contains only examples, no explanation of what's happening whatsoever. I would expect it too explain remotes way better - this is a primary purpose of this.

SSH example is too complicated -

  • the same dir for git and dvc on remote is too specific, very uncommon and distracts a lot
  • name of the remote storage should not use cache in it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is strange to see that it contains only examples, no explanation of what's happening whatsoever

It is a modified version of this page: https://dvc.org/doc/use-cases/sharing-data-and-model-files
That one does not have much explanations either and things are explained mostly by the example.

Actually I don't find it feasible to explain a solution without using at least a few DVC commands, and for those commands to make sense they have to be used in the context of an example. So, the description mainly describes the situation, and the solution is described by the examples. The hope is that once the reader has understood the solution he can generalize and adopt it for his own case.

I would expect it too explain remotes way better - this is a primary purpose of this.

I am planning to explain the details of the remotes (and their types) on another section. This section is about data sharing scenarios, so let's just refer to the remote details, but not include them here.

the same dir for git and dvc on remote is too specific, very uncommon and distracts a lot

Yes, the Git repository is usually located on GitHub. But this is just an example, an assumption to keep things simple and interactive.

name of the remote storage should not use cache in it

I tried to keep the analogy with Git. In Git a central bare repository is usually named project.git. So, a central DVC storage/cache is name project.cache.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am planning to explain the details of the remotes (and their types) on another section. This section is about data sharing scenarios, so let's just refer to the remote details, but not include them here.

I don't think we need both that was point. I'm confused why do we need both. I think the remote section is enough. There should be some "DVC workflow" section that explains from a high level perspective the workflow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need both

I don't think that these UG pages:
https://dvc-org-pr-807.herokuapp.com/doc/user-guide/external-data (from another PR)
can be merged with the pages of this PR. I think they should be separate sections.


This is the recommended and the most common case of data sharing. In this case
we setup a [remote storage](/doc/command-reference/remote) on a data storage
provider, to store data files online, where others can reach them. Currently DVC
supports Amazon S3, Google Cloud Storage, Microsoft Azure Blob Storage, SSH,
HDFS, and other remote locations, and the list is constantly growing.

## S3 Remote Example

As an example, let's take a look at how you could setup an S3
[remote storage](/doc/command-reference/remote) for a <abbr>DVC project</abbr>,
and push/pull to/from it.

### Create an S3 bucket

If you don't already have one available in your S3 account, follow instructions
in
[Create a Bucket](https://docs.aws.amazon.com/AmazonS3/latest/gsg/CreatingABucket.html).
As an advanced alternative, you may use the
[`aws s3 mb`](https://docs.aws.amazon.com/cli/latest/reference/s3/mb.html)
command instead.

### Setup DVC remote

To actually configure a S3 remote in the <abbr>project</abbr>, supply the URL to
the bucket where the data should be stored to the `dvc remote add` command. For
example:

```dvc
$ dvc remote add -d myremote s3://mybucket/myproject
Setting 'myremote' as a default remote.
```

> The `-d` (`--default`) option sets `myremote` as the default remote storage
> for this project.

This will add `myremote` to your `.dvc/config`. The `config` file now have a
section like this:

```dvc
['remote "myremote"']
url = s3://mybucket/myproject
[core]
remote = myremote
```

`dvc remote` provides a wide variety of options to configure S3 bucket. For more
information see `dvc remote modify`.

Let's commit your changes and push your code:

```dvc
$ git add .dvc/config
$ git push
```

### Upload data and code

After adding data to the <abbr>project</abbr> with `dvc run` or other commands,
it should be stored in your local <abbr>cache</abbr>. Upload it to remote
storage with the `dvc push` command:

```dvc
$ dvc push
```

Code and [DVC-files](/doc/user-guide/dvc-file-format) should be committed and
pushed with Git.

### Download code

Please use regular Git commands to download code and DVC-files from your Git
servers. For example:

```dvc
$ git clone https://github.com/myaccount/myproject.git
$ cd myproject
```

or

```dvc
$ git pull
```

### Download data

To download data files for your <abbr>project</abbr>, run:

```dvc
$ dvc pull
```

`dvc pull` will download the missing data files from the default remote storage
configured in the `.dvc/config` file.

## SSH Remote Example

As an other example, let's see how to setup an SSH remote storage for a project
and share data through it.

> For more detailed instructions check out this
> [interactive example](https://katacoda.com/dvc/courses/examples/ssh-storage).

In this example we will assume a central data storage server that can be
accessed through SSH from two different users. For the sake of example the
central Git repository will be located in this server too, but in general it can
be anywhere, it doesn't have to be on the same server with the DVC data storage.

<p align="center">
<img src="/static/img/user-guide/data-sharing/ssh-storage.png"/>
</p>

### Setup the server

Usually we need to do these configurations on a SSH server:

- Create accounts for each user and add them to groups for accessing the Git
repository and the DVC storage.
- Create a bare git repository (for example on `/srv/project.git/`) and an empty
directory for the DVC storage (for example on `/srv/project.cache/`).

- Grant users read/write access to these directories (through the groups).

### Setup each user

When we have to access a SSH server, we definitely want to generate ssh key
pairs and setup the SSH config so that we can access the server without a
password.

Let's assume that for each user we can use the private ssh key
`~/.ssh/dvc-server` to access the server without a password, and we have also
added on `~/.ssh/config` lines like these:

```
Host dvc-server
HostName host01
User user1
IdentityFile ~/.ssh/dvc-server
IdentitiesOnly yes
```

Here `dvc-server` is the name or alias that we can use for our server, `host01`
can actually be the IP or the FQDN of the server, and `user1` is the username of
the first user on the server.

### Setup DVC remote

The configuration of the project with the SSH remote storage can be done with a
command like this:

```dvc
$ dvc remote add --default \
ssh-cache ssh://dvc-server:/srv/project.cache
```

This command will add a default remote configuration on `.dvc/config` that looks
like this:

```
['remote "ssh-cache"']
url = ssh://dvc-server:/srv/project.cache
[core]
remote = ssh-cache
```

Note that this configuration is the same for all the users, so we can add it to
Git in order to share it with the other users:

```dvc
$ git add .dvc/config
$ git commit -m 'Add a SSH remote cache'
$ git push
```

### Sharing data

After adding data to the project with `dvc add` and `dvc run`, it is stored in
`.dvc/cache`. We can upload to the server both the code changes and the data
like this:

```dvc
$ git push
$ dvc push
```

On the other end, we can receive the code changes and data like this:

```dvc
$ git pull
$ dvc pull
```
Loading