Skip to content

Commit

Permalink
December community gems (#3058)
Browse files Browse the repository at this point in the history
* started December community gems

* added more questions and answers

* added last Q/A

* updated a question; minor style updates

* minor style changes

* Update content/blog/2021-12-21-December-21-community-gems.md

Co-authored-by: David de la Iglesia Castro <[email protected]>

* updated comment link

* updated Ilia question

* updated jan office hours description

* minor clean up

* added video link

* addressed more feedback

* fixed text

Co-authored-by: David de la Iglesia Castro <[email protected]>
  • Loading branch information
flippedcoder and daavoo authored Dec 21, 2021
1 parent fc9aed8 commit 043b92e
Show file tree
Hide file tree
Showing 2 changed files with 225 additions and 0 deletions.
225 changes: 225 additions & 0 deletions content/blog/2021-12-21-December-21-community-gems.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,225 @@
---
title: December '21 Community Gems
date: 2021-12-21
description: >
A roundup of technical Q&A's from the DVC community. This month: comparing
experiments, working with data, working with pipelines, and more.
descriptionLong: >
A roundup of technical Q&A's from the DVC community. This month: comparing
experiments, working with data, working with pipelines, and more.
picture: 2021-12-21/dec-community-gems.png
author: milecia_mcgregor
commentsUrl: https://discuss.dvc.org/t/december-21-community-gems/1001
tags:
- Data Versioning
- DVC Remotes
- DVC API
- DVC Stages
- Community
---

### [I'm using Google Drive as a remote storage and accidentally entered the verification from the wrong Google account. How can I edit that?](https://discord.com/channels/485586884165107732/563406153334128681/908437162150739978)

No problem @fireballpoint1! This happens sometimes.

You should be able to run the following command in your terminal and then
re-enter your credentials.

```dvc
$ rm .dvc/tmp/gdrive-user-credentials.json
```

That should give you a chance to enter the correct credentials when you try to
`dvc pull` again.

### [Can I add a `dvc remote` which refers to NAS by IP so I don't have to mount on every computer?](https://discord.com/channels/485586884165107732/563406153334128681/912667503283564544)

That's a new question for us @Krzysztof Begiedza!

If you enable the SSH service on your NAS, you can configure DVC to use it as an
SSH remote with `dvc remote add`.

There should also be DSM (Synology DiskStation Manager) packages for webdav as
well, if you prefer that over SSH. Just make sure that when you run
`dvc remote add -d storage <URL>`, your remote storage URL looks similar to
this.

```
webdav://<ip>/<path>
```

### [Can you selectively `dvc pull` data files?](https://discord.com/channels/485586884165107732/563406153334128681/913713923667148850)

Great question @Clemens!

You would run `dvc pull <file>` to get the files you want. You could also use
the `--glob` option on `dvc pull` and DVC will only pull the relevant files.

The command for that pull would be similar to this.

```dvc
$ dvc pull path/to/specific/file
```

You could also make a
[data registry](https://dvc.org/doc/use-cases/data-registries) and use
`dvc import` in other projects to get a specific dataset. That way you don't
have to do a granular pull.

### [What is the fastest way to get the specific value of a metric of an experiment based on experiment id?](https://discord.com/channels/485586884165107732/563406153334128681/916328260856590346)

That's a really good question @Kwon-Young!

You can always look through experiment metrics with `dvc exp show` and this
shows you all of the experiments you've run.

To get the metrics for a specific experiment or set of experiments, you'll need
the experiment ids and then you can use the Python API like this example.

```python
from dvc.repo import Repo

dvc = Repo(".") # or Repo("path/to/repo/dir")
metrics = dvc.metrics.show(revs=["exp-name1", "exp-name2", ...])
```

This returns a Python dictionary that contains what gets displayed in
`dvc metrics show --json` except you're able to specify the experiments you
want.

### [Is it possible to run the whole pipeline but only for one element of the `foreach`?](https://discord.com/channels/485586884165107732/563406153334128681/915986804577026088)

Another great question from @vgodie!

If your stages look something like this for example:

```yaml
stages:
cleanups:
foreach: # List of simple values
- raw1
- labels1
- raw2
do:
cmd: clean.py "${item}"
outs:
- ${item}.cln
train:
foreach:
- epochs: 3
thresh: 10
- epochs: 10
thresh: 15
do:
cmd: python train.py ${item.epochs} ${item.thresh}
```
You should try the following command:
```dvc
$ dvc repro cleanups@labels1
```

This will run your whole pipeline, but only with `labels1` in the `cleanups`
stage.

### [Is it possible to pull experiments from the remote without checking out the base commit of those experiments?](https://discord.com/channels/485586884165107732/485596304961962003/910481311905505290)

Thanks for the question @mattlbeck!

You should be able to do this with `dvc exp pull origin exp-name`.

If you have experiments with the same name on different commits, using
`exp-name` won't work since it defaults to selecting the one based on your
current commit if there are duplicate names.

To work around this, you can use the full refname, like
`refs/exps/e7/78ad744e8d0cd59ddqc65d5d698cf102533f85/exp-6cb7`, to specify the
experiments that you want to work with.

### [How should I handle checkpoints in PyTorch Lightning with DVCLive?](https://drive.google.com/file/d/1t0wPowk-PUinNjV4xchrzPZh7xsI8i37/view?usp=sharing)

This is a really good question that came from one of our Office Hours talks!
Thanks [Ilia Sirotkin](https://www.linkedin.com/in/sirily/)!

We have an [open issue](https://github.com/iterative/dvclive/issues/170) we
encourage you to follow for more details and to even contribute!

Python Lightning handles checkpoints differently from other libraries. This
affects the way metrics logging is executed and how models are saved.

You can write a custom callback to control saving everything and track it with
DVC and this is the workaround we suggest. You can implement the
`after_save_checkpoint` method and save the model file.

The way this works is by breaking your training process into small stages. You
should specify the stage’s checkpoint as the output of the stage and set it as a
dependency for the next stage. That way if something breaks, the `dvc repro`
command will resume your experiment from the last stage.

Your pipeline might look something like this:

```yaml
stages:
stage_0:
cmd: python train.py
outs:
- checkpoints/checkpoint_epoch=0.ckpt
next:
foreach:
- prev: 0
next: 1
- prev: 1
next: 2
do:
cmd: python train.py --checkpoint ${item.prev}
deps:
- checkpoints/checkpoint_epoch=${item.prev}.ckpt
outs:
- checkpoints/checkpoint_epoch=${item.next}.ckpt
```
Then you'll need to reuse the `ModelCheckpoint` that is included in
`pytorch_lightning` to capture the checkpoints. Here's a snippet of what that
could look like in your training script:

```python
# set checkpoint path
ckpt_path = os.path.abspath(os.path.join(os.path.dirname(__file__), "checkpoints"))
# checkpoints will be saved to checkpoints/checkpoint_epoch={epoch_number}.ckpt
cp = pl.callbacks.model_checkpoint.ModelCheckpoint(
monitor="train_loss_epoch",
save_top_k=1,
dirpath=ckpt_path,
filename='checkpoint_{epoch}')
```

### [Is there a feature for DVC to only sample and cache a subset of the tracked dataset, e.g. 10000 lines of a large file?](https://discord.com/channels/485586884165107732/485596304961962003/917778575845900340)

Really great question @Abdi!

You should be able to use the streaming capability of the DVC API to achieve
this goal.

Here is an example of a Python script that would do this:

```python
from dvc.api import open as dvcopen
with dvcopen('data',f'{repo_url}') as fd:
for line in fd:
print(line)
```

---

https://media.giphy.com/media/h5Ct5uxV5RfwY/giphy.gif

At our January Office Hours Meetup we will be looking at machine learning
workflows and Neovim-DVC plugin!
[RSVP for the Meetup here](https://www.meetup.com/DVC-Community-Virtual-Meetups/events/282663146/)
to stay up to date with specifics as we get closer to the event!

[Join us in Discord](https://discord.com/invite/dvwXA2N) to get all your DVC and
CML questions answered!
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 043b92e

Please sign in to comment.