blog: DVC with Jupyter notebooks #96

efiop · 2018-10-21T23:01:26Z

No description provided.

efiop · 2018-10-22T17:08:06Z

colllin · 2019-01-09T16:26:25Z

If you have any quick tips here, I would appreciate them. I typically use notebooks for development and inline visualization, and I'm trying to migrate a project to dvc right now — my first dvc project 🎉. I'm thinking it might be best to develop and debug in the notebook as usual, then when I'm ready to run the notebook end-to-end, use e.g. dvc run -d train.ipynb -o training.html -o checkpoint.pt jupyter nbconvert --to html --execute train.ipynb.

efiop · 2019-01-09T17:00:04Z

Hi @colllin !

I see by the name of the notebook train.ipynb, that you are splitting your pipeline into separate steps, that you then plan to run using dvc. That is precisely what we usually recommend! You should be all set 🎉 Please don't hesitate to share your experience, we would really appreciate it. 🙂

mlisovyi · 2019-03-09T13:53:30Z

Any progress on such example?

colllin · 2019-03-09T16:41:30Z

@mlisovyi I’ve been using Jupyter in a pipeline with a command like:

jupyter nbconvert Train.ipynb --clear-output --inplace --execute --ExecutePreprocessor.timeout=-1

This executes the notebook and overwrites it in-place, as if I had opened it in Jupyter and ran the entire notebook manually and saved it. I then commit the resulting notebook to git. I also specify some outputs which are cached: a directory for model checkpoints and a directory for logs. For dependencies, I specify the notebook itself as well as a directory of supporting modules.

I believe the initial command to set it up looked something like

dvc run -d Train.ipynb -d src/ -o checkpoints/ -o logs/ jupyter nbconvert Train.ipynb --clear-output --inplace --execute --ExecutePreprocessor.timeout=-1

You might also need to specify a name for the pipeline step somewhere in that command — I used train.dvc, which I can then execute using dvc repro train.dvc.

jorgeorpinel · 2021-09-27T00:24:06Z

Do we envision this as a regular part of our DVC user guide, or as a blog post? Cc WDYT @flippedcoder thanks

jorgeorpinel · 2021-10-12T01:54:33Z

Cc WDYT @flippedcoder thanks

Cc @jendefig

Also cc @dberenbaum — I think we discussed this topic at some point. Do you still have your DVC/Jupyter Notebook examples handy? Thanks

dberenbaum · 2021-10-12T12:36:11Z

There are a couple different ways to use notebooks with DVC.

The comments above are about running a notebook end-to-end as a DVC stage. I think the examples above give some good ideas about how best to do that.

Another way to integrate DVC and notebooks is to use DVC within the notebook. This could either be running DVC commands, like running experiments/stages from within the notebook, or doing some analysis or otherwise using artifacts or info from an existing DVC project. We plan to work on an experiments API in the future, which will probably be a good point at which to have some notebook examples like this.

jendefig · 2021-10-12T12:56:20Z

I would think a best practices for migration would be good for the docs and blog post. Showing different ways to do it in a series of blog posts?

daavoo · 2021-10-13T11:14:27Z

For me, the ultimate DVC-Jupyter integration (requiring quite some work) would be to provide users with something like custom IPython magic commands in order to generate DVC stuff. Similar to some of the functions that nbdev provides.

This would be in line with the workflow: 1. hacky prototype on notebook -> 2. move to python scripts -> 3. add DVC for reproducibility. These hypothetical DVC magic commands would help to go from 1 to 3 more easily.

For example (roughly speaking and with no details), given a jupyter cell:

EPOCHS = 10

for epoch in range(EPOCHS):
    print(epoch)

User would add the magic commands:

%%dvc stage train

%dvc param
EPOCHS = 10

for epoch in range(EPOCHS):
    print(epoch)

And the commands would generate something like a python script and updating DVC params/stage:

# train.py

if __name__ == "__main__":
    params = yaml.safe_load(params)
    EPOCHS = params["train"]["EPOCHS"]
    
    for epoch in range(EPOCHS):
       print(epoch)

# dvc.yaml
stages:
    train:
        cmd: python train.py params.yaml
        params:
            - train

# params.yaml
train:
    EPOCHS: 10

So user went from a Jupyter cell to being able to run dvc exp run -p train.EPOCHS=20

iesahin · 2021-10-13T15:07:14Z

That's a very nice idea @daavoo

I also believe that if there could be some kind of dependency resolution among the Jupyter cells, we could define and run the whole pipeline in a notebook.

%%stage params

EPOCHS=10

and another stage

%%stage train

model.train(epochs=EPOCHS)

Defining a pipeline like,

%%pipeline my-exp

%%depend train param

and running the experiment like

%%exp my-exp

one should be able to mimic most of the pipeline features. Later, it's possible to create DVC-files from these definitions by creating code files, params.yaml, etc.

dberenbaum · 2021-10-13T17:09:37Z

Can we move this to https://github.com/iterative/dvc/discussions? We can keep this ticket to document patterns like #96 (comment), but the discussion is now moving towards new feature ideas. Also related to the above suggestions: iterative/dvc#6011.

jorgeorpinel · 2021-10-18T23:13:53Z

@daavoo @iesahin I agree with @dberenbaum the feature suggestions are great but should be in the core repo please 🙂

Is there a recommendation/decision as to writing docs or a blog based on current features? Thanks

casperdcl · 2022-01-25T15:09:17Z

Strongly would recommend taking a look at https://github.com/nteract/papermill which integrates quite nicely with DVC :)

Essentially substitute python script.py with papermill notebook.ipynb. There are also lots of ways to play around with params, deps & outputs.

jorgeorpinel · 2022-08-04T20:55:41Z

Ping @jendefig 🙂 (I think you were looking for ideas, well this is the oldest open ticket one in this repo)

jendefig · 2022-08-08T20:06:56Z

@jorge Thanks! @flippedcoder is finishing up one on this now here. Not sure she is familiar with Papermill @casperdcl. With what you know could you take a look and see if there are any significant advantages over the approach being used now?

casperdcl · 2022-08-10T16:21:07Z

done

jorgeorpinel · 2022-11-04T19:41:25Z

Does https://iterative.ai/blog/jupyter-notebook-dvc-pipeline close this? WDYT @dberenbaum @jendefig

Cc @RCdeWit is there an issue for the planned blog follow-up? (Getting to an actual pipeline)

Thanks

jendefig · 2022-11-04T20:50:07Z

Does https://iterative.ai/blog/jupyter-notebook-dvc-pipeline close this? WDYT @dberenbaum @jendefig

Cc @RCdeWit is there an issue for the planned blog follow-up? (Getting to an actual pipeline)

Thanks

@jorgeorpinel I would think that until the follow-up one to the papermill one is done this isn't quite closed. But plans for the next one are already in our backlog, so it won't be lost.

shcheklein · 2023-02-26T18:13:09Z

I think we can close this for now. No need to track it here as a separate issue.

efiop added the A: docs Area: user documentation (gatsby-theme-iterative) label Oct 21, 2018

shcheklein added the help wanted Contributors especially welcome label Nov 27, 2018

shcheklein added the use-cases label Mar 25, 2019

shcheklein changed the title ~~docs: add "Jupyter notebook" article to "Use Cases"~~ add "Jupyter notebook" article to "Use Cases" Mar 25, 2019

shcheklein added the type: enhancement Something is not clear, small updates, improvement suggestions label Mar 25, 2019

This was referenced May 5, 2019

Add Jupyter notebook article in use case #292

Closed

GSoD'19 Applicant - Tapasweni Pathak #299

Closed

dashohoxha mentioned this issue Oct 25, 2019

user-guide: restructure #745

Closed

10 tasks

shcheklein removed the use-cases label Nov 13, 2019

shcheklein changed the title ~~add "Jupyter notebook" article to "Use Cases"~~ add "Jupyter notebook" article Nov 13, 2019

jorgeorpinel changed the title ~~add "Jupyter notebook" article~~ guide/blog: add "Jupyter notebook" article Sep 27, 2021

jorgeorpinel changed the title ~~guide/blog: add "Jupyter notebook" article~~ guide or blog? Add "Jupyter notebook" article Sep 27, 2021

jorgeorpinel changed the title ~~guide or blog? Add "Jupyter notebook" article~~ blog: DVC with Jupyter notebooks Oct 12, 2021

jorgeorpinel added A: docs Area: user documentation (gatsby-theme-iterative) and removed A: docs Area: user documentation (gatsby-theme-iterative) labels Oct 12, 2021

iesahin added the C: blog TEMPORARY Content of /blog label Oct 21, 2021

jorgeorpinel added status: stale You've been groomed! and removed type: enhancement Something is not clear, small updates, improvement suggestions help wanted Contributors especially welcome labels Aug 4, 2022

jorgeorpinel mentioned this issue Nov 4, 2022

guide: mature/pretty project checklist #4099

Closed

2 tasks

shcheklein closed this as completed Feb 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

blog: DVC with Jupyter notebooks #96

blog: DVC with Jupyter notebooks #96

efiop commented Oct 21, 2018

efiop commented Oct 22, 2018

colllin commented Jan 9, 2019 •

edited

Loading

efiop commented Jan 9, 2019

mlisovyi commented Mar 9, 2019

colllin commented Mar 9, 2019 •

edited

Loading

jorgeorpinel commented Sep 27, 2021

jorgeorpinel commented Oct 12, 2021

dberenbaum commented Oct 12, 2021

jendefig commented Oct 12, 2021

daavoo commented Oct 13, 2021 •

edited

Loading

iesahin commented Oct 13, 2021 •

edited

Loading

dberenbaum commented Oct 13, 2021

jorgeorpinel commented Oct 18, 2021

casperdcl commented Jan 25, 2022 •

edited

Loading

jorgeorpinel commented Aug 4, 2022 •

edited

Loading

jendefig commented Aug 8, 2022

casperdcl commented Aug 10, 2022

jorgeorpinel commented Nov 4, 2022

jendefig commented Nov 4, 2022

shcheklein commented Feb 26, 2023

blog: DVC with Jupyter notebooks #96

blog: DVC with Jupyter notebooks #96

Comments

efiop commented Oct 21, 2018

efiop commented Oct 22, 2018

colllin commented Jan 9, 2019 • edited Loading

efiop commented Jan 9, 2019

mlisovyi commented Mar 9, 2019

colllin commented Mar 9, 2019 • edited Loading

jorgeorpinel commented Sep 27, 2021

jorgeorpinel commented Oct 12, 2021

dberenbaum commented Oct 12, 2021

jendefig commented Oct 12, 2021

daavoo commented Oct 13, 2021 • edited Loading

iesahin commented Oct 13, 2021 • edited Loading

dberenbaum commented Oct 13, 2021

jorgeorpinel commented Oct 18, 2021

casperdcl commented Jan 25, 2022 • edited Loading

jorgeorpinel commented Aug 4, 2022 • edited Loading

jendefig commented Aug 8, 2022

casperdcl commented Aug 10, 2022

jorgeorpinel commented Nov 4, 2022

jendefig commented Nov 4, 2022

shcheklein commented Feb 26, 2023

colllin commented Jan 9, 2019 •

edited

Loading

colllin commented Mar 9, 2019 •

edited

Loading

daavoo commented Oct 13, 2021 •

edited

Loading

iesahin commented Oct 13, 2021 •

edited

Loading

casperdcl commented Jan 25, 2022 •

edited

Loading

jorgeorpinel commented Aug 4, 2022 •

edited

Loading