Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

blog: Converting a Jupyter Notebook to a DVC Project #3624

Closed
wants to merge 21 commits into from
Closed
Show file tree
Hide file tree
Changes from 18 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
283 changes: 283 additions & 0 deletions content/blog/2022-07-28-switching-to-dvc-from-jupyter-vscode.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,283 @@
---
title: Converting a Jupyter Notebook to a DVC Project
date: 2022-07-28
description: >
Working with notebooks is common in machine learning. That's why we're
covering some tools that make it easy to do more with a complex project.
descriptionLong: >
Once you've run some experiments in a Jupyter notebook, you know that you
can't save each experiment. Now, if you're using the Jupyter VS Code
extension, we can show you how to make those experiments reproducible with the
addition of the DVC VS Code extension.
picture: 2022-07-28/jupyter-to-dvc.png
pictureComment: Using the DVC VS Code Extension with a Jupyter Notebook
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pictureComment: Using the DVC VS Code Extension with a Jupyter Notebook
pictureComment: Using the DVC Extension for VS Code with a Jupyter Notebook

author: milecia_mcgregor
commentsUrl: https://discuss.dvc.org/t/syncing-data-to-aws-s3/1192
tags:
- MLOps
- DVC
- Git
- VS Code
- Juptyer Notebooks
---

For many machine learning engineers, the starting point of a project is a
Jupyter notebook. This is fine for running a few experiments, but there comes a
point where you need to scale the project to accomodate hundreds or even
thousands more experiments. These experiments for your model will include
different hyperparameter values, different code, and potentially different
resources. It will be important to track the experiments you run so that when
you find an exceptional model, you'll be able to reproduce it and get it ready
for production.

In this tutorial, we're going to start a project with a Juptyer notebook in VS
Code. Then we'll convert it to a DVC pipeline to make reproducible experiments
and use the DVC VS Code extension to run new experiments and see how to compare
them all.
[Here's the project](https://github.com/iterative/stale-model-example/tree/jupyter-to-dvc)
jendefig marked this conversation as resolved.
Show resolved Hide resolved
we'll be working with.

## Start training with the notebook

Many times you'll start a machine learning (ML) project with a few cells in a
notebook just to test out some thoughts you have. So you might have a notebook
where you set some hyperparameters, load your data, train a model, and evaluate
its metrics. Then you might add more cells to save the model, run comparisons
with other models, have different versions of the same cells, or anything else
because you had more thoughts you want to test out. That's similar to what we're
doing in the `bicycle_experiments.ipynb` file.

![Jupyter notebook cells](/uploads/images/2022-07-28/jupyter-notebook.png)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be a screenshot, or could it be a code snippet?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can do a code snippet. I wasn't quite sure if it was better to visually show the cells or show the code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, but I don't understand why we would put a code snippet when the purpose of the post is to go from a Jupyter Notebook to the yaml files and this paragraph is referring to a Notebook. Can someone explain to me what I'm missing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there was confusion here, and I assume @RCdeWit did not intend to show the underlying .ipynb file contents! It would be great if we could show the notebook but still be able to copy and paste the code, but I guess maybe a screen shot is best unless there is a way to embed the notebook or some portion of it here.

Copy link
Contributor Author

@flippedcoder flippedcoder Aug 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, cool. I also got confused here. 😅 I put the screenshot back and removed the .ipynb code, but I don't think there's a way to embed the notebook here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about starting with the parameters inside the notebook instead of reading them from params.yaml? I know it's harder to then convert to DVC, but it feels a bit unrealistic to be starting in a notebook but already have a parameters file configured.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this was not addressed. And I'm finding that for some reason when the notebook opens, all the cells are in Markdown. I thought this was a problem in VS Code, but when I open in Jupyter in a browser it's the same. @RCdeWit we are going to have to figure out how to fix this notebook to make it more representative of real life.


We have all of the cells in place so we can start running experiments. This is
usually fine for training models for a while. Then it turns into a situation
where you have cells all over the place and some aren't useful after a certain
point, but they stay in the notebook, adding clutter and noise.

Eventually, you'll likely find a great model with your notebook experiments, but
you have no idea which cells you ran or which data was used to train this model.

That makes reproducing the experiment impossible and you're left with a great
model you may not be able to use in production. Once you reach the point where
you are trying to reproduce models or compare metrics from multiple experiments,
it might make sense to look at a data versioning and model experiment tracking
tool like DVC.

## Refactor the Jupyter notebook to Python scripts

We're going to take the existing Jupyter notebook and break the cells out into
files and stages that DVC tracks for you. First, we'll create a `train.py` file
to handle the model training stage of the experiment. This file will have the
`Get params`, `Load training data`, `Train model`, and `Save model` cells from
the earlier notebook.

```python
# train.py

import os
import pickle5 as pickle
import sys

import numpy as np
import yaml
from sklearn.ensemble import RandomForestClassifier

params = yaml.safe_load(open("params.yaml"))["train"]

input = sys.argv[1]
output = sys.argv[2]
seed = params["seed"]
n_est = params["n_est"]
min_split = params["min_split"]

with open(os.path.join(input, "train.pkl"), "rb") as fd:
matrix = pickle.load(fd)

labels = matrix.iloc[:, 11].values
x = matrix.iloc[:,1:11].values

clf = RandomForestClassifier(
n_estimators=n_est, min_samples_split=min_split, n_jobs=2, random_state=seed
)

clf.fit(x, labels)

with open(output, "wb") as fd:
pickle.dump(clf, fd)
```

Next, we'll make an `evaluate.py` file that will take a saved model and get the
metrics for how well it performs. This file will have the `Set test variables`,
`Load model and test data`, `Get model predictions`,
`Calculate model performance metrics`, and `Save model performance metrics`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to params.yaml above, it seems unlikely that someone is already saving the model and especially the performance metrics to a file if they are working in a notebook.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one also needs to be addressed in the notebook.

notebook cells.

```python
# evaluate.py

import json
import math
import os
import pickle5 as pickle
import sys

import sklearn.metrics as metrics
import numpy as np

model_file = sys.argv[1]
test_file = os.path.join(sys.argv[2], "test.pkl")
scores_file = sys.argv[3]
prc_file = sys.argv[4]
roc_file = sys.argv[5]

with open(model_file, "rb") as fd:
model = pickle.load(fd)

with open(test_file, "rb") as fd:
matrix = pickle.load(fd)

x = matrix.iloc[:,1:11].values

cleaned_x = np.where(np.isnan(x), 0, x)
labels_pred = model.predict(cleaned_x)

predictions_by_class = model.predict_proba(cleaned_x)
predictions = predictions_by_class[:, 1]

print(predictions)

precision, recall, prc_thresholds = metrics.precision_recall_curve(labels_pred, predictions, pos_label=1)

fpr, tpr, roc_thresholds = metrics.roc_curve(labels_pred, predictions, pos_label=1)

avg_prec = metrics.average_precision_score(labels_pred, predictions)
roc_auc = metrics.roc_auc_score(labels_pred, predictions)

nth_point = math.ceil(len(prc_thresholds) / 1000)
prc_points = list(zip(precision, recall, prc_thresholds))[::nth_point]

with open(scores_file, "w") as fd:
json.dump({"avg_prec": avg_prec, "roc_auc": roc_auc}, fd, indent=4)

with open(prc_file, "w") as fd:
json.dump(
{
"prc": [
{"precision": p, "recall": r, "threshold": t}
for p, r, t in prc_points
]
},
fd,
indent=4,
)

with open(roc_file, "w") as fd:
json.dump(
{
"roc": [
{"fpr": fp, "tpr": tp, "threshold": t}
for fp, tp, t in zip(fpr, tpr, roc_thresholds)
]
},
fd,
indent=4,
)
```

Now you have all of the steps that you executed in your Jupyter notebook in a
couple of files that you can easily edit and track across all of your
experiments. This is a great time to commit these changes to your Git repo with
the following commands:

```cli
$ git add train.py evaluate.py
$ git commit -m "converted notebook to Python"
```

## Create the DVC pipeline

Now we can create a DVC pipeline that executes these scripts to record the code,
data, and metrics for each of your experiments. If you look in the project's
`dvc.yaml`, you'll see the stages we execute on an experiment run.

```yaml
stages:
train:
cmd: python src/train.py ./data/ ./models/model.pkl
deps:
- ./data/train.pkl
- ./src/train.py
params:
- train.seed
- train.n_est
- train.min_split
outs:
- ./models/model.pkl
evaluate:
cmd:
python ./src/evaluate.py ./models/model.pkl ./data scores.json prc.json
roc.json
deps:
- ./data
- ./models/model.pkl
- ./src/evaluate.py
metrics:
- scores.json:
cache: false
```

The `stages` tell DVC which steps you want to execute and what should happen in
each step. Usually, you'll execute a script or a command in each stage that may
link to the next stage in the pipeline via the `outs`. We only have 2 stages in
this pipeline: a `train` stage that handles the model training and outputs the
model and an `evaluate` stage that takes the model and stores some metrics about
it.

Each of these stages has a `cmd` that executes the Python scripts we wrote with
the required arguments. They both have defined dependencies in `deps` that let
DVC know what needs to be available for a stage to execute before it starts
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make it even more explicit, maybe add something along the lines here of

"As you can see, for example, the training stage is listed as a requirement for the evaluation stage. This ensures that the latter will only start once the first has been completed."

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And maybe as an admonition somewhere saying that when using dvc exp run only the stages downstream from your changes are triggered?

running. The `train` stage has some `params` that represent the hyperparameter
values we want to use in the current experiment. This is how DVC is able to
track the values used in each experiment.

The `train` stage also has `outs` defined which takes the model generated at the
end of the experiment and saves it to this location. Meanwhile, the `evaluate`
stage has a `metrics` section that defines what DVC will use for metrics when
we're ready to compare experiments.

This runs everything in the same order that the Jupyter notebook did with a
trackable structure since we're executing Python scripts now. When you run
`dvc exp run` to conduct an experiment, you can check out your metrics with
either the CLI command `dvc exp show` or with
[the DVC VS Code extension](https://marketplace.visualstudio.com/items?itemName=Iterative.dvc).

```dvctable
─────────────────────────────────────────────────────────────────────────────────────────────────────────────>
neutral:**Experiment** neutral:**Created** metric:**avg_prec** metric:**roc_auc** param:**train.seed** param:**train.n_est** param:**train.min_split** >
────────────────────────────────────────────────────────────────────────────────────────────────────────────>
**workspace** **-** **0.76681** **0.38867** **20210428** **300** **75** >
**jupyter-to-dvc** **Jul 18, 2022** **0.76681** **0.38867** **20210428** **300** **75** >
└── 4a070a7 [exp-b8925] Jul 18, 2022 0.76681 0.38867 20210428 300 75 >
────────────────────────────────────────────────────────────────────────────────────────────────────────────>
```

_with CLI tool_

![metrics in DVC VS Code extension](/uploads/images/2022-07-28/dvc-exp-in-vscode.png)

_with DVC VS Code extension_

You can also run experiments directly using the DVC VS Code extension.

![run experiments in DVC VS Code extension](/uploads/images/2022-07-28/experiments-in-extension.png)

dberenbaum marked this conversation as resolved.
Show resolved Hide resolved
## Conclusion

In this post, we covered how to convert your Jupyter notebook into a DVC
project. When your project gets to the point you need to go back to old
experiments, it's probably time to consider using something more advanced than
Jupyter notebooks. Keeping track of data versions across experiments along with
the code that was used to run them can get messy quickly so it's good to know
about tools that can make it easier for you. If you want to learn more about
experiment reproducibility and how to handle that with DVC, you should check out
our in-depth [Iterative tools course](https://learn.iterative.ai/)!
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.