Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

blog: Converting a Jupyter Notebook to a DVC Project #3624

Closed
wants to merge 21 commits into from
Closed
Show file tree
Hide file tree
Changes from 16 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
213 changes: 213 additions & 0 deletions content/blog/2022-07-28-switching-to-dvc-from-jupyter-vscode.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,213 @@
---
title: Converting a Jupyter Notebook to a DVC Project
date: 2022-07-28
description: >
Working with notebooks is common in machine learning. That's why we're
covering some tools that make it easy to do more with a complex project.
descriptionLong: >
Once you've run some experiments in a Jupyter notebook, you know that you
can't save each experiment. Now, if you're using the Jupyter VS Code
extension, we can show you how to make those experiments reproducible with the
addition of the DVC VS Code extension.
picture: 2022-07-28/jupyter-to-dvc.png
pictureComment: Using the DVC VS Code Extension with a Jupyter Notebook
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pictureComment: Using the DVC VS Code Extension with a Jupyter Notebook
pictureComment: Using the DVC Extension for VS Code with a Jupyter Notebook

author: milecia_mcgregor
commentsUrl: https://discuss.dvc.org/t/syncing-data-to-aws-s3/1192
tags:
- MLOps
- DVC
- Git
- VS Code
- Juptyer Notebooks
---

For many machine learning engineers, the starting point of a project is a
Jupyter notebook. This is fine for running a few experiments, but there comes a
point where you need to scale the project to accomodate hundreds or even
thousands more experiments. These experiments for your model will include
different hyperparameter values, different code, and potentially different
resources. It will be important to track the experiments you run so that when
you find an exceptional model, you'll be able to reproduce it and get it ready
for production.

In this tutorial, we're going to start a project with a Juptyer notebook in VS
Code. Then we'll convert it to a DVC pipeline to make reproducible experiments
and use the DVC VS Code extension to run new experiments and see how to compare
them all.
[Here's the project](https://github.com/iterative/stale-model-example/tree/jupyter-to-dvc)
jendefig marked this conversation as resolved.
Show resolved Hide resolved
we'll be working with.

## Start training with the notebook

Many times you'll start a machine learning (ML) project with a few cells in a
notebook just to test out some thoughts you might have. So you might have a
simple notebook where you set some hyperparameters, load your data, train a
model and evaluate its metrics, and then save the model. That's what we're doing
in the `bicycle_experiments.ipynb` file.

```ipynb
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Bike experiment notebook"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Install packages"
]
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"import os\n",
"import pickle\n",
"import sys\n",
"\n",
"import numpy as np\n",
"import yaml\n",
"from sklearn.ensemble import RandomForestClassifier\n",
"import sklearn.metrics as metrics"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Get params"
]
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"print(\"Works\")\n",
"\n",
"params = yaml.safe_load(open(\"params.yaml\"))[\"train\"]\n",
"\n",
"input = \"./data/\"\n",
"output = \"./models/model.pkl\"\n",
"\n",
"seed = params[\"seed\"]\n",
"n_est = params[\"n_est\"]\n",
"min_split = params[\"min_split\"]"
]
},
...
```

We have all of the cells in place so we can start running experiments. This is
usually fine for training models for a while. Although there will eventually be
a point where you are powering through experiments for the day and you want to
compare metrics across experiments. You might also end up with a great model,
but you have no idea what code you used or which data was used to train this
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs to be written to be more specific to these problems occurring in Jupyter notebook instead of generally like this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idea: as an illustration provide a specific issue a data scientist might run into with this specific example. Then generalize from there

model.

That makes reproducing the experiment impossible and you're left with a great
model you may not be able to use in production. Once you reach the point where
you are trying to reproduce models or compare metrics from multiple experiments,
it might make sense to look at a data versioning and model experiment tracking
tool like DVC.

## Refactor the Jupyter notebook to use DVC

We're going to take the existing Jupyter notebook and break the cells out into
files and stages that DVC tracks for you. First, we'll create a `train.py` file
to handle the model training stage of the experiment. This file will have the
`Get params`, `Load training data`, `Train model`, and `Save model` cells from
the earlier notebook.

Next, we'll make an `evaluate.py` file that will take a saved model and get the
metrics for how well it performs. This file will have the `Set test variables`,
`Load model and test data`, `Get model predictions`,
`Calculate model performance metrics`, and `Save model performance metrics`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to params.yaml above, it seems unlikely that someone is already saving the model and especially the performance metrics to a file if they are working in a notebook.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one also needs to be addressed in the notebook.

notebook cells.

That's it! Now you have all of the steps that you executed in your Jupyter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I was just getting started, I'm not sure that given the two paragraphs above I'd understand what the train.py would need to look like here. Maybe provide the solution as a code snippet?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that it would be helpful to show more of the code for a blog post like this, probably for both the notebook and the .py scripts.

jendefig marked this conversation as resolved.
Show resolved Hide resolved
notebook in a couple of files that you can easily edit and track across all of
your experiments. This is a great time to commit these changes to your Git repo
with the following commands:

```cli
$ git add train.py evaluate.py
$ git commit -m "converted notebook to DVC project"
```

## Look at metrics from experiments

Now we can create a DVC pipeline that executes these scripts to get the metrics
for your experiments. If you look in the project's `dvc.yaml`, you'll see the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you don't provide the steps to create the dvc.yaml from scratch, which I think is what one might want to do when following this guide. On the other hand, for brevity this is an easier approach.

I think you either need to show the steps to configure the pipeline, or to explain the structure of dvc.yaml so that readers understand how the config works. Either way, it allows them to apply this to their own projects.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. IMO showing the dvc.yaml like you do is nice because others can copy and paste and edit it to fit their needs instead of writing from scratch or learning other commands. However, I think it's glossed over a little too quickly.

Some ideas:

  • Move this up to the previous section about converting to DVC.
  • Explain more of the structure as @RCdeWit said.
  • Look for ways to simplify it -- I don't see the plots used below, so maybe leave them out?

stages we execute on an experiment run.

```yaml
stages:
train:
cmd: python src/train.py ./data/ ./models/model.pkl
deps:
- ./data/train.pkl
- ./src/train.py
params:
- train.seed
- train.n_est
- train.min_split
outs:
- ./models/model.pkl
evaluate:
cmd:
python ./src/evaluate.py ./models/model.pkl ./data scores.json prc.json
roc.json
deps:
- ./data
- ./models/model.pkl
- ./src/evaluate.py
metrics:
- scores.json:
cache: false
plots:
- prc.json:
cache: false
x: recall
y: precision
- roc.json:
cache: false
x: fpr
y: tpr
```

This runs everything in the same order that the Jupyter notebook did with a
trackable structure. Now when you run `dvc exp run` to conduct an experiment,
you can check out your metrics with either the CLI command `dvc exp show` or
with
[the DVC VS Code extension](https://marketplace.visualstudio.com/items?itemName=Iterative.dvc).

```dvctable
─────────────────────────────────────────────────────────────────────────────────────────────────────────────>
neutral:**Experiment** neutral:**Created** metric:**avg_prec** metric:**roc_auc** param:**train.seed** param:**train.n_est** param:**train.min_split** >
────────────────────────────────────────────────────────────────────────────────────────────────────────────>
**workspace** **-** **0.76681** **0.38867** **20210428** **300** **75** >
**jupyter-to-dvc** **Jul 18, 2022** **0.76681** **0.38867** **20210428** **300** **75** >
└── 4a070a7 [exp-b8925] Jul 18, 2022 0.76681 0.38867 20210428 300 75 >
────────────────────────────────────────────────────────────────────────────────────────────────────────────>
```

_with CLI tool_

![metrics in DVC VS Code extension](/uploads/images/2022-07-28/dvc-exp-in-VS
Code.png)

_with DVC VS Code extension_

dberenbaum marked this conversation as resolved.
Show resolved Hide resolved
## Conclusion

In this post, we covered how to convert your Jupyter notebook into a DVC
project. When your project gets to the point you need to go back to old
experiments, it's probably time to consider using something more advanced than
Jupyter notebooks. Keeping track of data versions across experiments along with
the code that was used to run them can get messy quickly so it's good to know
about tools that can make it easier for you.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe refer to the course here as well? We give a more in-depth tutorial there

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.