-
Notifications
You must be signed in to change notification settings - Fork 394
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
blog: Converting a Jupyter Notebook to a DVC Project #3624
Changes from 18 commits
f624db2
b280545
5211aa0
e8d8d6e
776733b
9bc45ea
cbf049e
d8551c0
55f6eaf
c161649
1b6467e
aeb5d40
104faa0
e6cb5da
0c5ee5f
ebea012
1471049
abc907e
17ab495
055b810
dde5485
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,283 @@ | ||
--- | ||
title: Converting a Jupyter Notebook to a DVC Project | ||
date: 2022-07-28 | ||
description: > | ||
Working with notebooks is common in machine learning. That's why we're | ||
covering some tools that make it easy to do more with a complex project. | ||
descriptionLong: > | ||
Once you've run some experiments in a Jupyter notebook, you know that you | ||
can't save each experiment. Now, if you're using the Jupyter VS Code | ||
extension, we can show you how to make those experiments reproducible with the | ||
addition of the DVC VS Code extension. | ||
picture: 2022-07-28/jupyter-to-dvc.png | ||
pictureComment: Using the DVC VS Code Extension with a Jupyter Notebook | ||
author: milecia_mcgregor | ||
commentsUrl: https://discuss.dvc.org/t/syncing-data-to-aws-s3/1192 | ||
tags: | ||
- MLOps | ||
- DVC | ||
- Git | ||
- VS Code | ||
- Juptyer Notebooks | ||
--- | ||
|
||
For many machine learning engineers, the starting point of a project is a | ||
Jupyter notebook. This is fine for running a few experiments, but there comes a | ||
point where you need to scale the project to accomodate hundreds or even | ||
thousands more experiments. These experiments for your model will include | ||
different hyperparameter values, different code, and potentially different | ||
resources. It will be important to track the experiments you run so that when | ||
you find an exceptional model, you'll be able to reproduce it and get it ready | ||
for production. | ||
|
||
In this tutorial, we're going to start a project with a Juptyer notebook in VS | ||
Code. Then we'll convert it to a DVC pipeline to make reproducible experiments | ||
and use the DVC VS Code extension to run new experiments and see how to compare | ||
them all. | ||
[Here's the project](https://github.com/iterative/stale-model-example/tree/jupyter-to-dvc) | ||
jendefig marked this conversation as resolved.
Show resolved
Hide resolved
|
||
we'll be working with. | ||
|
||
## Start training with the notebook | ||
|
||
Many times you'll start a machine learning (ML) project with a few cells in a | ||
notebook just to test out some thoughts you have. So you might have a notebook | ||
where you set some hyperparameters, load your data, train a model, and evaluate | ||
its metrics. Then you might add more cells to save the model, run comparisons | ||
with other models, have different versions of the same cells, or anything else | ||
because you had more thoughts you want to test out. That's similar to what we're | ||
doing in the `bicycle_experiments.ipynb` file. | ||
|
||
![Jupyter notebook cells](/uploads/images/2022-07-28/jupyter-notebook.png) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does this need to be a screenshot, or could it be a code snippet? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I can do a code snippet. I wasn't quite sure if it was better to visually show the cells or show the code. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry, but I don't understand why we would put a code snippet when the purpose of the post is to go from a Jupyter Notebook to the yaml files and this paragraph is referring to a Notebook. Can someone explain to me what I'm missing? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think there was confusion here, and I assume @RCdeWit did not intend to show the underlying There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok, cool. I also got confused here. 😅 I put the screenshot back and removed the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What do you think about starting with the parameters inside the notebook instead of reading them from There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It looks like this was not addressed. And I'm finding that for some reason when the notebook opens, all the cells are in Markdown. I thought this was a problem in VS Code, but when I open in Jupyter in a browser it's the same. @RCdeWit we are going to have to figure out how to fix this notebook to make it more representative of real life. |
||
|
||
We have all of the cells in place so we can start running experiments. This is | ||
usually fine for training models for a while. Then it turns into a situation | ||
where you have cells all over the place and some aren't useful after a certain | ||
point, but they stay in the notebook, adding clutter and noise. | ||
|
||
Eventually, you'll likely find a great model with your notebook experiments, but | ||
you have no idea which cells you ran or which data was used to train this model. | ||
|
||
That makes reproducing the experiment impossible and you're left with a great | ||
model you may not be able to use in production. Once you reach the point where | ||
you are trying to reproduce models or compare metrics from multiple experiments, | ||
it might make sense to look at a data versioning and model experiment tracking | ||
tool like DVC. | ||
|
||
## Refactor the Jupyter notebook to Python scripts | ||
|
||
We're going to take the existing Jupyter notebook and break the cells out into | ||
files and stages that DVC tracks for you. First, we'll create a `train.py` file | ||
to handle the model training stage of the experiment. This file will have the | ||
`Get params`, `Load training data`, `Train model`, and `Save model` cells from | ||
the earlier notebook. | ||
|
||
```python | ||
# train.py | ||
|
||
import os | ||
import pickle5 as pickle | ||
import sys | ||
|
||
import numpy as np | ||
import yaml | ||
from sklearn.ensemble import RandomForestClassifier | ||
|
||
params = yaml.safe_load(open("params.yaml"))["train"] | ||
|
||
input = sys.argv[1] | ||
output = sys.argv[2] | ||
seed = params["seed"] | ||
n_est = params["n_est"] | ||
min_split = params["min_split"] | ||
|
||
with open(os.path.join(input, "train.pkl"), "rb") as fd: | ||
matrix = pickle.load(fd) | ||
|
||
labels = matrix.iloc[:, 11].values | ||
x = matrix.iloc[:,1:11].values | ||
|
||
clf = RandomForestClassifier( | ||
n_estimators=n_est, min_samples_split=min_split, n_jobs=2, random_state=seed | ||
) | ||
|
||
clf.fit(x, labels) | ||
|
||
with open(output, "wb") as fd: | ||
pickle.dump(clf, fd) | ||
``` | ||
|
||
Next, we'll make an `evaluate.py` file that will take a saved model and get the | ||
metrics for how well it performs. This file will have the `Set test variables`, | ||
`Load model and test data`, `Get model predictions`, | ||
`Calculate model performance metrics`, and `Save model performance metrics` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Similar to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This one also needs to be addressed in the notebook. |
||
notebook cells. | ||
|
||
```python | ||
# evaluate.py | ||
|
||
import json | ||
import math | ||
import os | ||
import pickle5 as pickle | ||
import sys | ||
|
||
import sklearn.metrics as metrics | ||
import numpy as np | ||
|
||
model_file = sys.argv[1] | ||
test_file = os.path.join(sys.argv[2], "test.pkl") | ||
scores_file = sys.argv[3] | ||
prc_file = sys.argv[4] | ||
roc_file = sys.argv[5] | ||
|
||
with open(model_file, "rb") as fd: | ||
model = pickle.load(fd) | ||
|
||
with open(test_file, "rb") as fd: | ||
matrix = pickle.load(fd) | ||
|
||
x = matrix.iloc[:,1:11].values | ||
|
||
cleaned_x = np.where(np.isnan(x), 0, x) | ||
labels_pred = model.predict(cleaned_x) | ||
|
||
predictions_by_class = model.predict_proba(cleaned_x) | ||
predictions = predictions_by_class[:, 1] | ||
|
||
print(predictions) | ||
|
||
precision, recall, prc_thresholds = metrics.precision_recall_curve(labels_pred, predictions, pos_label=1) | ||
|
||
fpr, tpr, roc_thresholds = metrics.roc_curve(labels_pred, predictions, pos_label=1) | ||
|
||
avg_prec = metrics.average_precision_score(labels_pred, predictions) | ||
roc_auc = metrics.roc_auc_score(labels_pred, predictions) | ||
|
||
nth_point = math.ceil(len(prc_thresholds) / 1000) | ||
prc_points = list(zip(precision, recall, prc_thresholds))[::nth_point] | ||
|
||
with open(scores_file, "w") as fd: | ||
json.dump({"avg_prec": avg_prec, "roc_auc": roc_auc}, fd, indent=4) | ||
|
||
with open(prc_file, "w") as fd: | ||
json.dump( | ||
{ | ||
"prc": [ | ||
{"precision": p, "recall": r, "threshold": t} | ||
for p, r, t in prc_points | ||
] | ||
}, | ||
fd, | ||
indent=4, | ||
) | ||
|
||
with open(roc_file, "w") as fd: | ||
json.dump( | ||
{ | ||
"roc": [ | ||
{"fpr": fp, "tpr": tp, "threshold": t} | ||
for fp, tp, t in zip(fpr, tpr, roc_thresholds) | ||
] | ||
}, | ||
fd, | ||
indent=4, | ||
) | ||
``` | ||
|
||
Now you have all of the steps that you executed in your Jupyter notebook in a | ||
couple of files that you can easily edit and track across all of your | ||
experiments. This is a great time to commit these changes to your Git repo with | ||
the following commands: | ||
|
||
```cli | ||
$ git add train.py evaluate.py | ||
$ git commit -m "converted notebook to Python" | ||
``` | ||
|
||
## Create the DVC pipeline | ||
|
||
Now we can create a DVC pipeline that executes these scripts to record the code, | ||
data, and metrics for each of your experiments. If you look in the project's | ||
`dvc.yaml`, you'll see the stages we execute on an experiment run. | ||
|
||
```yaml | ||
stages: | ||
train: | ||
cmd: python src/train.py ./data/ ./models/model.pkl | ||
deps: | ||
- ./data/train.pkl | ||
- ./src/train.py | ||
params: | ||
- train.seed | ||
- train.n_est | ||
- train.min_split | ||
outs: | ||
- ./models/model.pkl | ||
evaluate: | ||
cmd: | ||
python ./src/evaluate.py ./models/model.pkl ./data scores.json prc.json | ||
roc.json | ||
deps: | ||
- ./data | ||
- ./models/model.pkl | ||
- ./src/evaluate.py | ||
metrics: | ||
- scores.json: | ||
cache: false | ||
``` | ||
|
||
The `stages` tell DVC which steps you want to execute and what should happen in | ||
each step. Usually, you'll execute a script or a command in each stage that may | ||
link to the next stage in the pipeline via the `outs`. We only have 2 stages in | ||
this pipeline: a `train` stage that handles the model training and outputs the | ||
model and an `evaluate` stage that takes the model and stores some metrics about | ||
it. | ||
|
||
Each of these stages has a `cmd` that executes the Python scripts we wrote with | ||
the required arguments. They both have defined dependencies in `deps` that let | ||
DVC know what needs to be available for a stage to execute before it starts | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To make it even more explicit, maybe add something along the lines here of "As you can see, for example, the training stage is listed as a requirement for the evaluation stage. This ensures that the latter will only start once the first has been completed." There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. And maybe as an admonition somewhere saying that when using |
||
running. The `train` stage has some `params` that represent the hyperparameter | ||
values we want to use in the current experiment. This is how DVC is able to | ||
track the values used in each experiment. | ||
|
||
The `train` stage also has `outs` defined which takes the model generated at the | ||
end of the experiment and saves it to this location. Meanwhile, the `evaluate` | ||
stage has a `metrics` section that defines what DVC will use for metrics when | ||
we're ready to compare experiments. | ||
|
||
This runs everything in the same order that the Jupyter notebook did with a | ||
trackable structure since we're executing Python scripts now. When you run | ||
`dvc exp run` to conduct an experiment, you can check out your metrics with | ||
either the CLI command `dvc exp show` or with | ||
[the DVC VS Code extension](https://marketplace.visualstudio.com/items?itemName=Iterative.dvc). | ||
|
||
```dvctable | ||
─────────────────────────────────────────────────────────────────────────────────────────────────────────────> | ||
neutral:**Experiment** neutral:**Created** metric:**avg_prec** metric:**roc_auc** param:**train.seed** param:**train.n_est** param:**train.min_split** > | ||
────────────────────────────────────────────────────────────────────────────────────────────────────────────> | ||
**workspace** **-** **0.76681** **0.38867** **20210428** **300** **75** > | ||
**jupyter-to-dvc** **Jul 18, 2022** **0.76681** **0.38867** **20210428** **300** **75** > | ||
└── 4a070a7 [exp-b8925] Jul 18, 2022 0.76681 0.38867 20210428 300 75 > | ||
────────────────────────────────────────────────────────────────────────────────────────────────────────────> | ||
``` | ||
|
||
_with CLI tool_ | ||
|
||
![metrics in DVC VS Code extension](/uploads/images/2022-07-28/dvc-exp-in-vscode.png) | ||
|
||
_with DVC VS Code extension_ | ||
|
||
You can also run experiments directly using the DVC VS Code extension. | ||
|
||
![run experiments in DVC VS Code extension](/uploads/images/2022-07-28/experiments-in-extension.png) | ||
|
||
dberenbaum marked this conversation as resolved.
Show resolved
Hide resolved
|
||
## Conclusion | ||
|
||
In this post, we covered how to convert your Jupyter notebook into a DVC | ||
project. When your project gets to the point you need to go back to old | ||
experiments, it's probably time to consider using something more advanced than | ||
Jupyter notebooks. Keeping track of data versions across experiments along with | ||
the code that was used to run them can get messy quickly so it's good to know | ||
about tools that can make it easier for you. If you want to learn more about | ||
experiment reproducibility and how to handle that with DVC, you should check out | ||
our in-depth [Iterative tools course](https://learn.iterative.ai/)! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.