Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update dvc push documentation #203

Merged
merged 4 commits into from
Mar 18, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions static/docs/commands-reference/pull.md
Original file line number Diff line number Diff line change
Expand Up @@ -88,7 +88,7 @@ for more details.
Determines the files to download by searching the named directory and its
subdirectories for DVC files to download data for. Along with providing a
`target`, or `target` along with `--with-deps` it is yet another way to cut
the scope of DVC files to download files for.
the scope of DVC files to download.

* `-j JOBS`, `--jobs JOBS` - specifies number of jobs to run simultaneously
while downloading files from the remote cache. The effect is to control the
Expand All @@ -106,8 +106,8 @@ for more details.

## Examples

Using the `dvc pull` command remote storage to be defined. For an existing
projects is usually already defined and you can use `dvc remote list` to check
Using the `dvc pull` command remote storage must be defined. For an existing
projects a remote is usually defined and you can use `dvc remote list` to check
existing remotes. Just to remind how it is done and set a context for the
example, let's define an SSH remote with the `dvc remote add` command:

Expand Down
347 changes: 321 additions & 26 deletions static/docs/commands-reference/push.md
Original file line number Diff line number Diff line change
@@ -1,42 +1,128 @@
# push

This command pushes all data file caches related to the current Git branch to
the remote storage.
Uploads files and directories from the current branch in the local workspace to
the [remote storage]('doc/commands-reference/remote').
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We push data from cache based on DVC files in the working space. For example, (let's double check this), if I run something with --no-commit and then dvc push, data from the working space won't be uploaded to remote. Again, let's confirm and let's come with a better summary.


The command pushes only output of a specific stage if dvc file is specified `dvc push data.zip.dvc`.


See `dvc remote`, `dvc config` and
[remote storages](https://dvc.org/doc/get-started/configure)
for more information on how to configure the remote storage.
## Synopsis

```usage
usage: dvc push [-h] [-q] [-v] [-j JOBS] [-r REMOTE] [-a]
[-T] [-d]
[targets [targets ...]]
usage: dvc push [-h] [-q | -v] [-j JOBS] [--show-checksums]
[-r REMOTE] [-a]
[-T] [-d] [-R]
[targets [targets ...]]

positional arguments:
targets DVC files.

optional arguments:
-h, --help show this help message and exit
-q, --quiet Be quiet.
-v, --verbose Be verbose.
-j JOBS, --jobs JOBS Number of jobs to run simultaneously.
--show-checksums Show checksums instead of file names.
-r REMOTE, --remote REMOTE
Remote repository to push to
-a, --all-branches Push cache for all branches.
-T, --all-tags Push cache for all tags.
-d, --with-deps Push cache for all dependencies of the specified
target.
-R RECURSIVE, --recursive RECURSIVE
Push cache from subdirectories of specified directory.

```

## Description

The `dvc push` command is the twin pair to the `dvc pull` command, and together
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that a lot of users still don't understand how all these commands work (dvc status -c, dvc pull/push/fetch, etc). Could we think about some explanation similar to what we have in dvc add? It might be helpful to try making examples more detailed - show DVC file content, explain that it will extract checksums from it and will be pushing/pulling only those files to/from cache to/from remote.

they are the means for uploading and downloading data to and from remote storage.
[Data sharing](/doc/use-cases/share-data-and-model-files) across environments
and preserving data versions (input datasets, intermediate results, models,
metrics, etc) remotely (S3, SSH, GCS, etc) are the most common use cases for
these commands.

The `dvc push` command allows one to upload data to remote storage.

Under the hood a few actions are taken:

* The push command by default searches for files from all current DVC stages.
The command-line options listed below will either limit or expand the set of
stages to consult.
* For each file referenced from each selected stage there is a corresponding
entry in the local cache. DVC checks if the file exists, or not, in the remote
cache simply by looking for it using the checksum. From this DVC gathers a
list of files missing from the remote cache.
* Upload the cache files missing from the remote cache, if any, to the remote.

The DVC `push` command always works with a remote cache, and it is an error if
none are specified on the command line nor in the configuration. If a
`--remote REMOTE` option is not specified, then the default remote, configured
with the `core.config` config option, is used. See `dvc remote`, `dvc config`
and this [example](/doc/get-started/configure) for more information on how to
configure a remote.

With no arguments, just `dvc push` or `dvc push --remote REMOTE`, it uploads
only the files (or directories) that are new in the local repository to the
remote cache. It will not upload files associated with earlier versions or
branches of the project directory, nor will it upload files which have not
changed.

The command `dvc status -c` can list files that are new in the local cache and
are referenced in the current workspace. It can be used to see what files
`dvc push` would upload.

The `dvc status -c` command can show files which exist in the remote cache and
not exist in the local cache. Running `dvc push` from the local cache does not
remove nor modify those files in the remote cache.

If one or more `targets` are specified, DVC only considers the files associated
with those stages. Using the `--with-deps` option DVC tracks dependencies
backward through the pipeline to find data files to push.

## Options

* `--show-checksums` - shows checksums instead of file names.

* `-r REMOTE`, `--remote REMOTE` specifies which remote cache
(see `dvc remote list`) to push to. The value for `REMOTE` is a cache name
defined using the `dvc remote` command. If no `REMOTE` is given, or if no
remote's are defined in the workspace, an error message is printed. If the
option is not specified, then the default remote, configured with the
`core.config` config option, is used.

* `-a`, `--all-branches` - determines the files to upload by examining files
associated with all branches of the DVC files in the project directory. It's
useful if branches are used to track "checkpoints" of an experiment or
project.

* `-T`, `--all-tags` - the same as `-a`, `--all-branches` but tags are used to
save different experiments or project checkpoints.

* `-d`, `--with-deps` - determines the files to upload by searching backwards
in the pipeline from the named stage(s). The only files which will be
considered are associated with the named stage, and the stages which execute
earlier in the pipeline.

* `-R`, `--recursive` - the `targets` value is expected to be a directory path.
With this option, `dvc pull` determines the files to upload by searching the
named directory, and its subdirectories, for DVC files for which to upload
data. Along with providing a `target`, or `target` along with `--with-deps`,
it is yet another way to limit the scope of DVC files to upload.

* `-j JOBS`, `--jobs JOBS` - specifies number of jobs to run simultaneously
while uploading files to the remote cache. The effect is to control the
number of files uploaded simultaneously. Default is `4 * cpu_count()`. For
example with `-j 1` DVC uploads one file at a time, with `-j 2` it uploads
two at a time, and so forth. For SSH remotes default is set to 4.

* `-h`, `--help` - shows the help message and exit.

* `-q`, `--quiet` - does not write anything to standard output. Exit with 0 if
no problems arise, otherwise 1.

* `-v`, `--verbose` - displays detailed tracing information from executing the
`dvc push` command.

## Examples

Using the `dvc push` command remote storage must be defined. For an existing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the comment above ^^. I think it worth explaining and illustrating it with more details to show state (with tree .) before/after, show DVC file content, show that a referenced file is in cache or not in cache, etc. It's definitely worth explaining at least at one of those example.

Copy link
Contributor Author

@robogeek robogeek Mar 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have put together an example focusing explicitly on what happens in the cache from dvc push operations. I've pushed it to the pull request so we can discuss whether this form is useful or not.

project a remote is usually defined and you can use `dvc remote list` to check
existing remotes. Just to remind how it is done and set a context for the
example, let's define an SSH remote with the `dvc remote add` command:

```dvc
$ dvc remote add r1 ssh://_username_@_host_/path/to/dvc/cache/directory
$ dvc remote list
r1 ssh://_username_@_host_/path/to/dvc/cache/directory
```

> DVC supports several protocols for remote storage. For details, see the
[`remote add`](/doc/commands-reference/remote-add) documentation.

Push all data file caches from the current Git branch to the default remote:

```dvc
Expand All @@ -53,7 +139,216 @@ Push all data file caches from the current Git branch to the default remote:
```

Push outputs of a specific dvc file:

```dvc
$ dvc push data.zip.dvc

[#################################] 100% data.zip
```

## Examples: With dependencies

Demonstrating the `--with-deps` flag requires a larger example. First, assume
a pipeline has been setup with these stages:

```dvc
$ dvc pipeline show

data/Posts.xml.zip.dvc
Posts.xml.dvc
Posts.tsv.dvc
Posts-test.tsv.dvc
matrix-train.p.dvc
model.p.dvc
Dvcfile
```

The local cache has been modified such that the data files in some of these
stages should be uploaded to the remote cache.

```dvc
$ dvc status --cloud

new: data/model.p
new: data/matrix-test.p
new: data/matrix-train.p
```

One could do a simple `dvc push` to share all the data, but what if you only want
to upload part of the data?

```dvc
$ dvc push --remote r1 --with-deps matrix-train.p.dvc

(1/2): [####################] 100% data/matrix-test.p data/matrix-test.p
(2/2): [####################] 100% data/matrix-train.p data/matrix-train.p

... Do some work based on the partial update

$ dvc push --remote r1 --with-deps model.p.dvc

(1/1): [####################] 100% data/model.p data/model.p

... Push the rest of the data

$ dvc push --remote r1

Everything is up to date.

$ dvc status --cloud

Pipeline is up to date. Nothing to reproduce.
```

With the first `dvc push` we specified a stage in the middle of the pipeline
while using `--with-deps`. This started with the named stage and searched
backwards through the pipeline for data files to upload. Because the stage
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do a single space everywhere? :) (btw you have different styles in this document).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does anyone read the raw markdown? What people are reading is the rendered markdown as HTML on the website, or else in the github repository. The raw markdown is for editing. Once it is rendered the number of spaces after a period, how lines are wrapped into paragraphs, and so on, all that is disappeared into the rendered HTML.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I consider writing docs (at least with Markdown or Latex) the same as coding. It's easier for everyone to write/code when there is a common style guide. In this specific case - there are some editors that automatically remove extra spaces (especially trailing). So, if it happens that someone creates a PR to fix a simple mistake there are chances we end up with a lot of unnecessary changes all over the file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good points, and I'll keep this in mind. On the other hand consider how documentation editing is different than code editing. Inserting a few words in the middle of a paragraph causes the rest of the paragraph to reflow -- if one is manually adjusting the text to fit into 80 characters per line. Meaning, the one line with inserted words will overflow 80 columns, then every following line in that paragraph probably also overflows, resulting in an excessive diff.

For DVC docs I'm making sure to remove trailing spaces and to fit things into 80 columns. And I've adjusted the settings to insert spaces rather than tabs when hitting TAB.

named `model.p.dvc` occurs later in the pipeline its data was not uploaded.

Later we ran `dvc push` specifying the stage `model.p.dvc`, and its data was
uploaded. And finally we ran `dvc push` then `dvc status` with no options to
show that all data had been uploaded.

## Examples: What happens in the cache

Let's take a detailed look at what happens to the DVC cache as you run an
experiment in a local workspace and push data to a remote cache. To set the
stage consider having created a workspace that contains some code and data, and
having created a remote cache. In this section we'll show the cache of a very
simple project, but the details of this project does not matter so much as what
happens in the caches as data is pushed.

Some work has been performed in the local workspace, and it contains new data
to upload to the shared remote cache. When running `dvc status --cloud` the
report will list several files in `new` state. By looking in the cache
directories we can see exactly what that means.

```dvc
$ tree .dvc/cache
.dvc/cache
├── 02
│   └── 423d88d184649a7157a64f28af5a73
├── 0b
│   └── d48000c6a4e359f4b81285abf059b5
├── 38
│   └── 64e70211d3bdb367ad1432bfc14c1f.dir
├── 3f
│   └── 957fa0f1bb46534d07f4fc2116d73d
├── 4a
│   └── 8c47036c79c01522e79ac0f518d0f7
├── 5e
│   └── 4a7d0cbe26eda55624439661db925d
├── 6c
│   └── 3074754e3a9b563b62c8f1a38670dc
├── 77
│   └── bea77463abe2b7c6b4d13f00d2c7b4
├── 88
│   └── c3db1c257136090dbb4a7ddf31e678.dir
└── f4

10 directories, 9 files
$ tree ../vault/recursive
../vault/recursive
├── 0b
│   └── d48000c6a4e359f4b81285abf059b5
├── 4a
│   └── 8c47036c79c01522e79ac0f518d0f7
├── 6c
│   └── 3074754e3a9b563b62c8f1a38670dc
├── 88
│   └── c3db1c257136090dbb4a7ddf31e678.dir
└── f4
└── 7482b18ecca728ba4ae931e5d568fb

5 directories, 5 files
```

The directory `.dvc/cache` is the local cache, while `../vault/recursive` is
the remote cache. This listing clearly shows the local cache has more files in
it than the remote cache. Therefore `new` literally means that new files exist
in the local cache relative to this remote cache.

Next we can upload part of the data from the local cache to remote cache using
the command `dvc push --with-deps STAGE-FILE.dvc`. Remember that `--with-deps`
searches backwards from the named stage to locate files to upload, and does not
upload files in subsequent stages.

After doing that we can inspect the remote cache again:

```dvc
$ tree ../vault/recursive
../vault/recursive
├── 0b
│   └── d48000c6a4e359f4b81285abf059b5
├── 38
│   └── 64e70211d3bdb367ad1432bfc14c1f.dir
├── 4a
│   └── 8c47036c79c01522e79ac0f518d0f7
├── 5e
│   └── 4a7d0cbe26eda55624439661db925d
├── 6c
│   └── 3074754e3a9b563b62c8f1a38670dc
├── 77
│   └── bea77463abe2b7c6b4d13f00d2c7b4
├── 88
│   └── c3db1c257136090dbb4a7ddf31e678.dir
└── f4
└── 7482b18ecca728ba4ae931e5d568fb

8 directories, 8 files
```

The remote cache now has some of the files which had been missing, but not all
of them. Indeed `dvc status --cloud` still lists a couple files as `new`. We
can clearly see this in that a couple files are in the local cache and not in
the remote cache.

After running `dvc push` to cause all files to be uploaded the remote cache
now has all the files:

```dvc
$ tree ../vault/recursive
../vault/recursive
├── 02
│   └── 423d88d184649a7157a64f28af5a73
├── 0b
│   └── d48000c6a4e359f4b81285abf059b5
├── 38
│   └── 64e70211d3bdb367ad1432bfc14c1f.dir
├── 3f
│   └── 957fa0f1bb46534d07f4fc2116d73d
├── 4a
│   └── 8c47036c79c01522e79ac0f518d0f7
├── 5e
│   └── 4a7d0cbe26eda55624439661db925d
├── 6c
│   └── 3074754e3a9b563b62c8f1a38670dc
├── 77
│   └── bea77463abe2b7c6b4d13f00d2c7b4
├── 88
│   └── c3db1c257136090dbb4a7ddf31e678.dir
└── f4
└── 7482b18ecca728ba4ae931e5d568fb

10 directories, 10 files

$ dvc status --cloud

Pipeline is up to date. Nothing to reproduce.

```

And running `dvc status --cloud` verifies that indeed there are no more files
to upload to the remote cache.

## Examples: Show checksums

Normally the file names are shown, but DVC can display the checksums instead.

```dvc
$ dvc push --remote r1 --show-checksums

(1/3): [####################] 100% 844ef0cd13ff786c686d76bb1627081c
(2/3): [####################] 100% c5409fafe56c3b0d4d4d8d72dcc009c0
(3/3): [####################] 100% a8c5ae04775fcde33bf03b7e59960e18
```