Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add mlem get started #109

Merged
merged 26 commits into from
Jul 21, 2022
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ repos:
- id: isort
args: [--profile=black, -l=99]
- repo: https://github.com/ambv/black
rev: 21.5b1
rev: 22.3.0
hooks:
- id: black
args: [-l, '99']
2 changes: 2 additions & 0 deletions example-mlem-get-started/.gitattributes
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
*.dvc linguist-language=YAML
dvc.lock linguist-language=YAML
4 changes: 4 additions & 0 deletions example-mlem-get-started/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# Custom
*.zip
/tmp
build/
37 changes: 37 additions & 0 deletions example-mlem-get-started/code/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# MLEM Get Started

This is an auto-generated repository for use in MLEM
[Get Started](https://mlem.ai/doc/get-started). It is a step-by-step quick
introduction into basic MLEM concepts.

🐛 Please report any issues found in this project here -
[example-repos-dev](https://github.com/iterative/example-repos-dev).

## Installation

Python 3.6+ is required to run code from this repo.

```console
$ git clone https://github.com/iterative/example-mlem-get-started
$ cd example-mlem-get-started
```

Now let's install the requirements. But before we do that, we **strongly**
recommend creating a virtual environment with a tool such as
[virtualenv](https://virtualenv.pypa.io/en/stable/):
aguschin marked this conversation as resolved.
Show resolved Hide resolved

```console
$ virtualenv -p python3 .env
$ source .env/bin/activate
$ pip install -r src/requirements.txt
```

## Existing stages

This project with the help of the Git tags reflects the sequence of actions that
shcheklein marked this conversation as resolved.
Show resolved Hide resolved
are run in the MLEM [get started](https://mlem.ai/doc/get-started) guide. Feel
free to checkout one of them and play with the MLEM commands having the
playground ready.

[comment]: <> (TODO)
mike0sv marked this conversation as resolved.
Show resolved Hide resolved
- `0-git-init`: Empty Git repository initialized.
18 changes: 18 additions & 0 deletions example-mlem-get-started/code/src/evaluate.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
import json

from mlem.api import apply
from mlem.core.metadata import load
from sklearn import metrics


def main():
y_pred = apply("rf", "test_x.csv", method="predict_proba")
y_true = load("test_y.csv")
roc_auc = metrics.roc_auc_score(y_true, y_pred, multi_class="ovr")

with open("metrics.json", "w") as fd:
json.dump({"roc_auc": roc_auc}, fd, indent=4)


if __name__ == "__main__":
main()
16 changes: 16 additions & 0 deletions example-mlem-get-started/code/src/prepare.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
from mlem.api import save
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split


def main():
data, y = load_iris(return_X_y=True, as_frame=True)
data["target"] = y
train_data, test_data = train_test_split(data, random_state=42)
save(train_data, "train.csv")
save(test_data.drop("target", axis=1), "test_x.csv")
save(test_data[["target"]], "test_y.csv")


if __name__ == "__main__":
main()
4 changes: 4 additions & 0 deletions example-mlem-get-started/code/src/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
pandas
scikit-learn
scipy
dvc[azure]
24 changes: 24 additions & 0 deletions example-mlem-get-started/code/src/train.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
from mlem.api import load, save
from sklearn.ensemble import RandomForestClassifier


def main():
df = load("train.csv")
data = df.drop("target", axis=1)
rf = RandomForestClassifier(
n_jobs=2,
random_state=42,
)
rf.fit(data, df.target)

save(
rf,
shcheklein marked this conversation as resolved.
Show resolved Hide resolved
"rf",
tmp_sample_data=data,
tags=["random-forest", "classifier"],
description="Random Forest Classifier",
shcheklein marked this conversation as resolved.
Show resolved Hide resolved
)


if __name__ == "__main__":
main()
156 changes: 156 additions & 0 deletions example-mlem-get-started/generate.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,156 @@
#!/bin/bash
# See https://dvc.org/doc/tutorials/get-started
mike0sv marked this conversation as resolved.
Show resolved Hide resolved

# Setup script env:
# e Exit immediately if a command exits with a non-zero exit status.
# u Treat unset variables as an error when substituting.
# x Print commands and their arguments as they are executed.
set -eux

HERE="$( cd "$(dirname "$0")" ; pwd -P )"
REPO_NAME="example-mlem-get-started"

BUILD_PATH="$HERE/build"

pushd $BUILD_PATH
if [ ! -d "$BUILD_PATH/.venv" ]; then
exit 1
virtualenv -p python3 .venv
export VIRTUAL_ENV_DISABLE_PROMPT=true
source .venv/bin/activate
echo '.venv/' > .gitignore
pip install gitpython
aguschin marked this conversation as resolved.
Show resolved Hide resolved
pip install "git+https://github.com/iterative/mlem#egg=mlem[all]" --use-deprecated=legacy-resolver
aguschin marked this conversation as resolved.
Show resolved Hide resolved
pip install -r $HERE/code/src/requirements.txt
fi
popd

source $BUILD_PATH/.venv/bin/activate

REPO_PATH="$HERE/build/$REPO_NAME"

if [ -d "$REPO_PATH" ]; then
echo "Repo $REPO_PATH already exists, please remove it first."
exit 1
fi

TOTAL_TAGS=15
mike0sv marked this conversation as resolved.
Show resolved Hide resolved
STEP_TIME=100000
BEGIN_TIME=$(( $(date +%s) - ( ${TOTAL_TAGS} * ${STEP_TIME}) ))
export TAG_TIME=${BEGIN_TIME}
export GIT_AUTHOR_DATE=${TAG_TIME}
export GIT_COMMITTER_DATE=${TAG_TIME}
tick(){
export TAG_TIME=$(( ${TAG_TIME} + ${STEP_TIME} ))
export GIT_AUTHOR_DATE=${TAG_TIME}
export GIT_COMMITTER_DATE=${TAG_TIME}
}

export GIT_AUTHOR_NAME="Mikhail Sveshnikov"
export GIT_AUTHOR_EMAIL="[email protected]"
export GIT_COMMITTER_NAME="$GIT_AUTHOR_NAME"
export GIT_COMMITTER_EMAIL="$GIT_AUTHOR_EMAIL"



mkdir -p $REPO_PATH
pushd $REPO_PATH

git init -b main
cp $HERE/code/README.md .
cp $HERE/.gitattributes .
git add .
tick
git commit -m "Initialize Git repository"
git tag -a "0-git-init" -m "Git initialized."

mlem init
tick
git add .mlem/config.yaml
git commit -m "Initialize MLEM project"
git tag -a "1-mlem-init" -m "MLEM initialized."


cp $HERE/code/src/requirements.txt .
cp $HERE/code/src/prepare.py .
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you don it similar to DVC get-started you need to bundle this code and share from S3 so that people could curl or wget it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not get it from raw.github.com?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

raw.github.com works as well, I don't see any particular problems with that .. I would bundle though everything as a single tar (or may github has a way to download revisions as a tar with curl, I don't know)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it contains some extra files that are not needed initially though?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it contains whole repo, yes. But I don't see any point of maintaining them separately, since they are very small and easy and also will be available in the docs themselves

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not about the size, it's about the workflow. In the dvc.org get started you would have a command:

wget code.zip
unzip code.zip
git add -a -m "add code"

https://dvc.org/doc/start/data-pipelines#expand-to-download-example-code

including any extra files there would confuse and ruin the flow of the document.

so, your call here - if you want something similar to DVC then consider creating a clean code package.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will consider this, thanks

Copy link
Member

@shcheklein shcheklein Apr 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thinking second time - I remember now why I didn't use GH (most likely) - I wanted to keep the command as short and nice as possible wget https://code.dvc.org/get-started/code.zip

git add .
python prepare.py
git add .mlem/dataset
tick
git commit -m "Create data stage"
shcheklein marked this conversation as resolved.
Show resolved Hide resolved
git tag -a "2-prepare-stage" -m "Data created."


cp $HERE/code/src/train.py .
python train.py
git add .mlem/model train.py
tick
git commit -m "Create train stage"
git tag -a "3-train-stage" -m "Model trained."


cp $HERE/code/src/evaluate.py .
python evaluate.py
git add metrics.json evaluate.py
tick
git commit -m "Evaluate model"
git tag -a "4-eval-stage" -m "Metrics calculated"

git rm -r --cached .mlem/model/rf .mlem/dataset/train.csv .mlem/dataset/test_x.csv .mlem/dataset/test_y.csv
git commit -m "stop tracking data"
shcheklein marked this conversation as resolved.
Show resolved Hide resolved

dvc init
dvc remote add myremote azure://example-mlem --default
mike0sv marked this conversation as resolved.
Show resolved Hide resolved
mlem config set default_storage.type dvc
python train.py
dvc add .mlem/model/rf .mlem/dataset/train.csv .mlem/dataset/test_x.csv .mlem/dataset/test_y.csv
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's not very common for DVC to track each independently (esp dataset) - is it intended?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are .mlem files in this dir. Can we do something like echo *.mlem > .dvcignore && dvc add data?

Copy link
Contributor Author

@mike0sv mike0sv Apr 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the correct pattern was found and it is /**/?*.mlem. Btw, does dvcignore support ! negation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this does not seem to work, since we cant dvc add .mlem/dataset/ since we lose .mlem files, even though they ignored. So going with dvc add .mlem/dataset/*.csv. Would be great to have some way to do "add everything except dvc-ignored"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, does dvcignore support ! negation?

yes, it does support it

it should be working, otherwise it's a bug to my mind

Copy link
Contributor Author

@mike0sv mike0sv Apr 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$ cat .dvcignore 
/**/*.mlem
!/.mlem/
$ dvc check-ignore .mlem -d
.dvcignore:2:!/.mlem/   .mlem

I was trying to make an exception for .mlem dir in a root, but this is what I've got

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

q - why do you need an exception like this? (curious)

does the same combination work for gitignore?

Copy link
Contributor Author

@mike0sv mike0sv Apr 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My goal is to exclude all *.mlem files, but not .mlem dir in repo root. But simple /**/*.mlem adds .mlem dir to ignore list. With git this exclusion works works

git add .mlem
tick
git commit -m "Switch to dvc storage"
git tag -a "5-switch-to-dvc" -m "Switched to DVC"
dvc push

popd

unset TAG_TIME
unset GIT_AUTHOR_DATE
unset GIT_COMMITTER_DATE
unset GIT_AUTHOR_NAME
unset GIT_AUTHOR_EMAIL
unset GIT_COMMITTER_NAME
unset GIT_COMMITTER_EMAIL

echo "`cat <<EOF-

The Git repo generated by this script is intended to be published on
https://github.com/iterative/example-mlem-get-started. Make sure the Github repo
mike0sv marked this conversation as resolved.
Show resolved Hide resolved
exists first and that you have appropriate write permissions.

To create it with https://cli.github.com/, run:

gh repo create iterative/example-mlem-get-started --public \
-d "Get Started MLEM project" -h "https://mlem.ai/doc/get-started"

Run these commands to force push it:

cd build/example-mlem-get-started
git remote add origin https://github.com/iterative/example-mlem-get-started
aguschin marked this conversation as resolved.
Show resolved Hide resolved
git push --force origin main
git push --force origin --tags
cd ../../

Run these to drop and then rewrite the experiment references on the repo:
mike0sv marked this conversation as resolved.
Show resolved Hide resolved

git ls-remote origin "refs/exps/*" | awk '{print \\$2}' | xargs -n 1 git push -d origin
dvc exp list --all --names-only | xargs -n 1 dvc exp push origin


Finally, return to the directory where you started:

cd ../..

You may remove the generated repo with:

rm -fR build

`"