Skip to content
This repository has been archived by the owner on Oct 16, 2024. It is now read-only.

GTO docs #199

Merged
merged 42 commits into from
Nov 23, 2022
Merged
Show file tree
Hide file tree
Changes from 12 commits
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
45d5377
first drafts
aguschin Oct 24, 2022
5ba326e
yarn format
aguschin Oct 24, 2022
6a33a05
remove mlem copy-paste parts
aguschin Oct 24, 2022
faa1b53
write use-cases index page. add MR page in use-cases
aguschin Oct 24, 2022
f2f1527
updating limitations
aguschin Oct 24, 2022
217c675
add use-cases (forgot to commit), add todos to user-guide
aguschin Oct 24, 2022
7895da1
fix sidebar
aguschin Nov 1, 2022
803e7cf
Apply suggestions from code review
aguschin Nov 1, 2022
826f278
process feedback
aguschin Nov 1, 2022
e3b90bd
add Git tags format
aguschin Nov 1, 2022
eff9755
refine Git tags format
aguschin Nov 1, 2022
65f66ee
write page: artifacts.yaml metafile
aguschin Nov 1, 2022
129b17f
user-guide/{artifacts,config,downstream}
aguschin Nov 2, 2022
c596dc9
fix lint
aguschin Nov 2, 2022
9f42ace
fixes on feedback
aguschin Nov 2, 2022
931dac1
Apply suggestions from code review
aguschin Nov 10, 2022
76dd044
command-reference index page
aguschin Nov 10, 2022
c4485ae
merge main
aguschin Nov 10, 2022
f2e7af0
first version of command reference
aguschin Nov 10, 2022
0f5b833
lightweight fixes
aguschin Nov 10, 2022
9eab27f
update annotate, assign, check-ref
aguschin Nov 11, 2022
65c9f51
write all other commands
aguschin Nov 11, 2022
e5a75b3
Apply suggestions from code review
aguschin Nov 17, 2022
570b77a
return register to GS
aguschin Nov 17, 2022
d747e56
simplify the docs structure, hide 'why gto?'
aguschin Nov 17, 2022
21930be
shorten command-reference/index
aguschin Nov 17, 2022
0b25aec
hide next steps in gs
aguschin Nov 17, 2022
68f2b98
reveal why-gto section
aguschin Nov 17, 2022
cee3374
Merge branch 'main' into gto-docs
aguschin Nov 17, 2022
0c0931e
Update theme to 0.1.24 for args linker fix
rogermparent Nov 17, 2022
7da4618
Merge pull request #227 from iterative/gto-docs-theme-0.1.24
yathomasi Nov 18, 2022
f374516
fix links
aguschin Nov 21, 2022
6ddfb70
fix lint
aguschin Nov 21, 2022
0f70d82
remove some sections from sidebar
aguschin Nov 21, 2022
98a06c4
fix dvc link
aguschin Nov 21, 2022
44d9e2f
Merge branch 'main' into gto-docs
aguschin Nov 22, 2022
2fb378e
reset package.json and yarn.lock
aguschin Nov 22, 2022
d8fec09
fix some feedback about CI/CD in GS
aguschin Nov 22, 2022
892c1f7
fix some feedback about CI/CD in GS
aguschin Nov 22, 2022
2ab8a9b
Update content/docs/gto/get-started.md
jorgeorpinel Nov 22, 2022
8f9b6a3
Apply suggestions from code review
aguschin Nov 23, 2022
c7ae5ff
yarn format
aguschin Nov 23, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions content/docs/gto/command-reference/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# Using GTO Commands
aguschin marked this conversation as resolved.
Show resolved Hide resolved

GTO is a command line tool. Here, we provide the specifications, complete
Copy link
Contributor Author

@aguschin aguschin Oct 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @francesco086. This is good point of view. I'm opening a thread based on your comment to keep a discussion re it in a single place.

Minor: If you check out DVC docs, you'll see there are docs for Studio and DVClive. We can put this "GTO documentation" there or here (I used mlem.ai cause it was easier for me). Or to a separate website, like iterative.ai/doc maybe? Not sure. We need some place to keep GTO docs anyway.

Major: explaining how to build a registry with DVC+GTO+MLEM. Good question where to put that. In this PR you can see I was going to put answers in /doc/gto/user-guide. I guess the Tutorial format would be the best for this, and we could add it to each product involved under Use Cases (e.g. here it can be next or instead of "Pure MLEM Model registry"):
image

The other option is to create a GS with this - but that would be way to heavy for Get Started. I guess Tutorial or blog post serves the purpose better.

Another place to have this is Model Registry page in Studio docs. But, not sure yet how UI (Studio) and CLI (GTO+DVC+MLEM Tutorial) could co-exist here. Maybe cross-links are a better approach than having this in Studio docs.

Again, good topic to think about 🤔 We also leave CML out of the picture above, it also can be a part of a MR...

@tapadipti, have you had any discussion about setting up a DVC+GTO+MLEM Tutorial to complement Studio docs? Looks like it much needed, but I can't see we ever created something like that.

Copy link
Contributor

@jorgeorpinel jorgeorpinel Oct 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a good place to put GTO docs if we want docs beyond a CMD/API ref (otherwise we could do with a README and possibly a site like https://docs.iterative.ai/dvc-task/reference/dvc_task/

Major: explaining how to build a registry with DVC+GTO+MLEM. Good question where to put that...
Tutorial format would be the best for this

We mention it very high-level in https://mlem.ai/doc/use-cases/model-registry now. And there's the https://iterative.ai/model-registry solution page separately. I'm not sure how much we want to go into the details of this 3-way integration. May be a good blog topic indeed. Let's create a separate issue to discuss that, though?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://iterative.ai/model-registry should have links to all relevant docs pages. But since the docs can't reside there, Studio docs look like the next best place to me for explaining how to build a registry with DVC+GTO+MLEM. We could create a Use cases section. But depending on how much and what content we need, a blog post may also suffice. And docs specific to the GTO cli should definitely be separate.

If you check out DVC docs, you'll see there are docs for Studio and DVClive.

This is to be changed. We will host Studio docs separately in its own docs site (like CML) - although we don't have dates for this yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so I'm trying to draft that blog post - please see https://www.notion.so/iterative/Tutorial-Model-Registry-in-Git-with-DVC-MLEM-and-GTO-af124368ce9f4523a568a7e1875c7af3 - high-level feedback would be appreciated.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @aguschin. I've left some comments in the draft blog post.

descriptions, and comprehensive usage examples for different `gto` commands.

For a list of all commands, type `gto -h`

## Typical GTO workflow

...
171 changes: 171 additions & 0 deletions content/docs/gto/get-started.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,171 @@
---
description:
'Learn how you can use GTO to create Artifact Registry in Git repository'
---

# Get Started

GTO helps you build an Artifact Registry on top of a Git repository (with a
special case of Machine Learning Model Registry). You can register relevant
versions of your files (e.g. ML model releases) and assign them to different
deployment environments (testing, shadow, production, etc.). Git-native
mechanisms are used, so you can automate the delivery of your ML project with
CI/CD, and adopt a GitOps approach in general.

This Get Started will walk you through basic GTO concepts and actions you would
like to do in the Artifact Registry.
aguschin marked this conversation as resolved.
Show resolved Hide resolved

## Showing the current state

aguschin marked this conversation as resolved.
Show resolved Hide resolved
Assuming GTO is already [installed](/doc/gto/install) in your active Python
environment, let's clone the example repo:

```cli
$ git clone https://github.com/iterative/example-gto
$ cd example-gto
```

This repo represents a simple example of Machine Learning Model Registry. Let's
review it:

```cli
$ gto show
╒══════════╤══════════╤════════╤═════════╤════════════╕
│ name │ latest │ #dev │ #prod │ #staging │
╞══════════╪══════════╪════════╪═════════╪════════════╡
│ churn │ v3.1.1 │ v3.1.1 │ v3.0.0 │ v3.1.0 │
│ segment │ v0.4.1 │ v0.4.1 │ - │ - │
│ cv-class │ v0.1.13 │ - │ - │ - │
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏼 👍🏼 👍🏼

I kind of like that we start by showing the end-result! It's a good way to deliver the value proposition quickly in here (main purpose of this doc).

╘══════════╧══════════╧════════╧═════════╧════════════╛
```

Here we have 3 models: `churn`, `segment` and `cv-class`. The latest versions of
them are shown in the column named `latest`. The latest is selected as the one
having the greatest [SemVer](https://semver.org).

Model versions could be promoted to different stages. Here we have 3 of them:
`dev`, `prod` and `staging`. When a model was never promoted to a stage, we see
`-` in the field.
aguschin marked this conversation as resolved.
Show resolved Hide resolved

## Registering versions and assigning stages

GTO can [register version](/doc/gto/command-reference/register) of artifacts and
[assign stages to them](/doc/gto/command-reference/assign). Both functionalities
aguschin marked this conversation as resolved.
Show resolved Hide resolved
work in a similar way, so let's walkthough only one of them here.

Let's assume the version `v0.1.13` of `cv-class` looks very promising, and now
we want to promote it to `dev` to test it:
aguschin marked this conversation as resolved.
Show resolved Hide resolved

```cli
$ gto assign cv-class --version v0.1.13 --stage dev
Created git tag 'cv-class#dev#1' that assigns stage to version 'v0.1.13'
To push the changes upstream, run:
git push origin cv-class#dev#1
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
```

GTO created a Git tag with a special format that contains instruction to assign
a stage to a version. We can push to Git repository to start the CI, but let's
aguschin marked this conversation as resolved.
Show resolved Hide resolved
ensure that changed our Registry first.
aguschin marked this conversation as resolved.
Show resolved Hide resolved

```cli
$ gto show
╒══════════╤══════════╤═════════╤═════════╤════════════╕
│ name │ latest │ #dev │ #prod │ #staging │
╞══════════╪══════════╪═════════╪═════════╪════════════╡
│ churn │ v3.1.1 │ v3.1.1 │ v3.0.0 │ v3.1.1 │
│ segment │ v0.4.1 │ v0.4.1 │ - │ - │
│ cv-class │ v0.1.13 │ v0.1.13 │ - │ - │
│ awesome │ v0.0.1 │ - │ - │ - │
╘══════════╧══════════╧═════════╧═════════╧════════════╛
```

The `gto show` output confirms our expectation.
aguschin marked this conversation as resolved.
Show resolved Hide resolved

## Acting downstream
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved

The power of using Git tags to register versions and assign stages is simple: we
can act upon them in well-known way - in CI/CD.
aguschin marked this conversation as resolved.
Show resolved Hide resolved

To see how it works, let's fork the
[example-gto repo](https://github.com/iterative/example-gto/fork) and push the
tag we just created to GitHub. For CI/CD to start, you'll need to enable them on
the "Actions" page of your fork.

<details>

### Step-by-step instruction

Fork the repo first. Make sure you uncheck "Copy the `main` branch only" to copy
Git tags as well:
<img width="877" alt="image" src="https://user-images.githubusercontent.com/6797716/199275275-439335f4-6f54-4cd7-910d-fc29ad3c095c.png">

Then enable workflows in your repo, for a Git tag to trigger CI:
<img width="869" alt="image" src="https://user-images.githubusercontent.com/6797716/199272682-dfd628bf-9599-4e85-a623-bf4a10c3d7e1.png">

</details>

Let's do the same thing we did locally, but for your remote repo. Don't forget
to replace the URL:

```cli
$ gto assign cv-class --version v0.1.13 --stage dev \
--repo https://github.com/aguschin/example-gto
aguschin marked this conversation as resolved.
Show resolved Hide resolved
Created git tag 'cv-class#dev#1' that assigns stage to version 'v0.1.13'
Running `git push origin cv-class#dev#1`
Successfully pushed git tag cv-class#dev#1 on remote.
```

Now the CI/CD should start, and you should see that we found out: it was
`cv-class` artifact, version `v0.1.13` that was assigned to `dev` stage. Using
this information, the step `Deploy (act on assigning a new stage)` was executed
(while `Publish (act on registering a new version)` was skipped):

<details>

### CI/CD execution example

<img width="875" alt="image" src="https://user-images.githubusercontent.com/6797716/199276636-bf996ad3-7d9c-4100-9f3c-6444730e4d19.png">

If you want to see more CI examples, check out
[the example-repo](https://github.com/iterative/example-gto/actions).
aguschin marked this conversation as resolved.
Show resolved Hide resolved

</details>

## Next steps

Thanks for completing this Get Started!

- If you want how to specify artifact's metainformation like `path`, `type` and
`description`, check out [User Guide](/doc/gto/user-guide).
- If you want to learn about using DVC to keep your artifact binaries in remote
storages, check out [DVC docs](https://dvc.org/doc).
- If you want to learn more about Studio, check out
[Studio docs](https://dvc.org/doc/studio).
- If you want to learn about using MLEM to deploying your model upon GTO stage
assignments, check out [MLEM docs](/doc/).

<!-- Adding a new artifact

We just saw how to commit a new ML model to the repo. It's saved under
aguschin marked this conversation as resolved.
Show resolved Hide resolved
`models/awesome.pkl`. Let's register the very first version of it.

```cli
$ gto register awesome
Created git tag '[email protected]' that registers version
To push the changes upstream, run:
git push origin [email protected]
```

Nice! Let's see the registry state now:

```cli
$ gto show
╒══════════╤══════════╤════════╤═════════╤════════════╕
│ name │ latest │ #dev │ #prod │ #staging │
╞══════════╪══════════╪════════╪═════════╪════════════╡
│ churn │ v3.1.1 │ v3.1.1 │ v3.0.0 │ v3.1.0 │
│ segment │ v0.4.1 │ v0.4.1 │ - │ - │
│ cv-class │ v0.1.13 │ - │ - │ - │
│ awesome │ v0.0.1 │ - │ - │ - │
╘══════════╧══════════╧════════╧═════════╧════════════╛
``` -->
39 changes: 39 additions & 0 deletions content/docs/gto/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# GTO Documentation

**GTO** is a tool for creating an Artifact Registry in your Git repository. One
of the special cases we would like to highlight is creating a **Machine Learning
Model Registry**.

Such a registry serves as a centralized place to store and operationalize your
artifacts along with their metadata; manage model life-cycle, versions &
aguschin marked this conversation as resolved.
Show resolved Hide resolved
releases, and easily automate tests and deployments using GitOps.

<cards>

<card href="/doc/gto/get-started" heading="Get Started">
A step-by-step introduction into basic GTO features
</card>

<card href="/doc/gto/user-guide" heading="User Guide">
Study the detailed inner-workings of GTO in its user guide.
</card>

<card href="/doc/gto/use-cases" heading="Use Cases">
Non-exhaustive list of scenarios GTO can help with
</card>

<card href="/doc/gto/command-reference" heading="Command Reference">
See all of GTO's commands
</card>

</cards>

✅ Please join our [community](https://dvc.org/community) or use the
[support](https://dvc.org/support) channels if you have any questions or need
specific help. We are very responsive ⚡.

✅ Check out our [GitHub repository](https://github.com/iterative/gto) and give
us a ⭐ if you like the project!

✅ Contribute to MLEM [on GitHub](https://github.com/iterative/gto) or help us
improve this [documentation](https://github.com/iterative/mlem.ai) 🙏.
aguschin marked this conversation as resolved.
Show resolved Hide resolved
33 changes: 33 additions & 0 deletions content/docs/gto/install.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# Installation

To create an Artifact Registry with GTO, you only need a Git repo and GTO
package installed. There's no need to set up any services or databases, compared
to many other Model Registry offerings.
Copy link
Contributor

@jorgeorpinel jorgeorpinel Nov 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
To create an Artifact Registry with GTO, you only need a Git repo and GTO
package installed. There's no need to set up any services or databases, compared
to many other Model Registry offerings.
You'll need [Python](https://www.python.org/) to install GTO, and
[Git](https://git-scm.com/) to use it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not clear now why you need DB/Services at all - if we talk about GTO installation, let's remove all mentions of MR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the part about DBs because it didn't seem too relevant to mention in the installation page, but it may make sense in other docs.

Not sure I understood your suggestion wrt MR mentions.


To check whether GTO is installed in your environment, run `which gto`. To check
which version is installed, run `gto --version`.

## Install as a Python library

GTO is a Python library. It works on any OS. You can install it with a package
aguschin marked this conversation as resolved.
Show resolved Hide resolved
manager like [pip](https://pypi.org/project/pip/) or
[Conda](https://docs.conda.io/en/latest/), or as a Python
[requirement](https://pip.pypa.io/en/latest/user_guide/#requirements-files).
aguschin marked this conversation as resolved.
Show resolved Hide resolved

<admon type="info">

We **strongly** recommend creating a [virtual environment] or using [pipx] to
encapsulate your local environment.

[virtual environment]: https://python.readthedocs.io/en/stable/library/venv.html
[pipx]:
https://packaging.python.org/guides/installing-stand-alone-command-line-tools/

</admon>

```cli
$ pip install gto
```

This will install the `gto` command-line interface (CLI) and make the Python API
aguschin marked this conversation as resolved.
Show resolved Hide resolved
available for use in code.
72 changes: 72 additions & 0 deletions content/docs/gto/use-cases/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
# Use Cases

**GTO** is a tool for creating an Artifact Registry in your Git repository. One
of the special cases we would like to highlight is creating a
[Machine Learning Model Registry](/doc/gto/use-cases/model-registry).

Such a registry serves as a centralized place to store and operationalize your
artifacts along with their metadata; manage model life-cycle, versions &
releases, and easily automate tests and deployments using GitOps.

Usually, Artifact Registry usage follows these three steps:

- **Registry**. Track new artifacts and their versions for releases and
significant changes. Usually this is needed for keeping track of lineage.
- **Lifecycle Management**. Create actionable stages for versions marking status
of artifact or it's readiness to be consumed by a specific environment.
- **Downstream Usage**. Signal CI/CD automation or other downstream systems to
act upon these new versions and lifecycle updates.

GTO helps you achieve all of them in a [GitOps](https://www.gitops.tech) way. If
you would like to see an example, please follow
[Get Started](/doc/gto/get-started).

## Why GTO?

In Software Engineering, Git is a heart of the Software system. The code is
committed to Git and CI/CD triggers on new commits making the downstream action
necessary. Such approaches as [GitOps](https://www.gitops.tech) made huge steps
towards automation of development cycles, reducing errors and helping maintain
productive software development.

Artifact Registries (and Model Registries in specific) usually introduce a
separate service or infrastructure, as well as new set of APIs to integrate
with. This often leads to a necessity to maintain two different systems, which
is a significant overhead. For example, if you work in Machine Learning, you
often need two teams (Data Science specialists and Software Engineers) each
responsible of maintaining their part of the system.

![](https://i.imgur.com/GTcrytE.png)

GTO builds that on top of Git repository using Git tags to register versions and
assign stages, and using `artifacts.yaml` file to keep the metainformation about
artifacts, such as `path`, `type`, `description` and etc. If your artifact
development is built around Git, you won't need to introduce new things for your
team to manage.

One example (although specific to Model Registry) is really good at
demonstrating this problem of handling two worlds at the same time. When you
train your Machine Learning models, you have to know what code and data was used
to do it. If Model Registry lives in a separate system, you (or the code you've
written) have to record the code and data snapshots (or just a Git commit
hexsha). Now if you forgot to record the hexsha when you registered a new model
version in Model Registry, or used an incorrect hexsha, no one can reproduce
your training process. Keeping track of both models and their versions in Git
solves that problem.

![](https://i.imgur.com/gViAnOu.png)

## Limitations

There are few limitations to the GTO approach to building an Artifact Registry:

- You shouldn't commit artifact binaries to Git itself. You should use Git-lfs,
or use DVC and other similar tools.
- Some teams develop artifacts (models) in a single monorepository, sometimes in
many separate ones. Since GTO operates with Git tags and files in a Git
Repository, it can't handle multiple repositories at a single time.
- GTO is a command-line and Python API tool. That makes it friendly for
engineers, although for less technical folks a Visual UI may be required.

If you hit the last two limitations, you may find
[Studio](https://dvc.org/doc/studio) useful.
62 changes: 62 additions & 0 deletions content/docs/gto/use-cases/model-registry.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
# Machine Learning Model Registry

A **model registry** is a tool to catalog ML models and their versions. Models
from your data science projects can be discovered, tested, shared, deployed, and
audited from there. [DVC](https://github.com/iterative/dvc), GTO, and [MLEM]
enable these capabilities on top of Git, so you can stick to en existing
software engineering stack. No more divide between ML engineering and
operations!

[mlem]: /doc

ML model registries give your team key capabilities:

- Collect and organize model [versions] from different sources effectively,
preserving their data provenance and lineage information.
- Share metadata including [metrics and plots][mp] to help use and evaluate
models.
- A standard interface to access all your ML artifacts, from early-stage
[experiments] to production-ready models.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes me very curious, have I missed something? Given a mlem model, I will know from which experiment it comes from? how? you store the dvc experiment reference as metadata? Is it explained somewhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, there is no such thing as link from .mlem metafile to the exact experiment. I guess the idea behind this words is that you have a Git repo, and given a commit, you can get your ML experiment (DVC) and model metadata (MLEM) and a signal what is production-ready (GTO). Does it make sense now?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok, thanks for the clarification!

Perhaps this is a point to clarify somewhere. When I think of such tools I personally always imagine to have a separate repo where I store e.g. the mlem models. Why? Mainly because at a certain point I would like to train as part of ci/cd and avoid the creation of new commits to the repo itself as part of the ci/cd. I know it's possible via cml, but I would prefer to avoid the need altogether.

Perhaps it's only me...

- Deploy specific models on different environments (dev, shadow, prod, etc.)
without touching the applications that consume them.
- For security, control who can manage models, and audit their usage trails.

[versions]: https://dvc.org/doc/use-cases/versioning-data-and-model-files
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Semantic versioning is the accepted way to version code. How should artifacts be versioned?
I have been asked this by a Data Scientist some time ago. Given that everyone is free to do whatever he wants, perhaps giving a hint is not bad...?

I formulated a reasonable convention for models, not sure if it could be of any use:

Patch

Model as a black-box is as before, it only outputs different numbers.

Typical scenario: model have been trained with more recent data
Typical scenario 2: changed hyper-parameters

Minor

May want to take advantage of additional outputs or additional functionalities

Typical scenario 1: model now has predict_proba() in addition to predict()
Typical scenario 2: model now outputs a json with an additional field confidence_interval, in addition to predicted_values

Major

Need to re-visit the code that calls the model to serve it (breaking change)

Typical scenario 1: model APIs have changed
Typical scenario 2: model expects different input data format
Typical scenario 3: model relies on different libraries, need to re-build the venv (or even the OS-level libraries)

Copy link
Contributor Author

@aguschin aguschin Nov 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea! We can turn this advice into a page in User Guide, e.g. "Semantic versioning for ML models". Not sure should it belong to Studio docs or GTO docs...
@jorgeorpinel, WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a discussion for https://mlem.ai/doc/use-cases/model-registry I think. No need to repeat the use case page in here (it's already in the same site). Link to it from GTO docs as needed instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think it's for Use Cases - they're too high-level and these are details. IMHO

Copy link
Contributor

@jorgeorpinel jorgeorpinel Nov 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. I meant I didn't think this file we're commenting on was needed here (gone now).

On ML SemVer, Idk. Seems opinionated to give such a precise recommendation. @francesco086 I encourage you to make a separate PR to contribute this in some existing or new page though, then the team can review it and decide.

It would probably belongs in GTO docs (merging this PR soon). That's what we use to annotate artifact versions right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've created an issue to address this #231

That's what we use to annotate artifact versions right?

Yes

[mp]: https://dvc.org/doc/start/metrics-parameters-plots
[experiments]: https://dvc.org/doc/user-guide/experiment-management

Many of these benefits are built into DVC: Your [modeling process] and
[performance data][mp] become **codified** in Git-based <abbr>DVC
repositories</abbr>, making it possible to reproduce and manage models with
standard Git workflows (along with code). Large model files are stored
separately and efficiently, and can be pushed to [remote storage] -- a scalable
access point for [sharing].

<admon type="info">

See also [Data Registry](https://dvc.org/doc/use-cases/data-registry).

</admon>

To make a Git-native registry (on top of DVC or not), one option is to use GTO
(Git Tag Ops). It tags ML model releases and promotions, and links them to
artifacts in the repo using versioned annotations. This creates abstractions for
your models, which lets you **manage their lifecycle** freely and directly from
Git.

And to **productionize** the models, you can save and build them with the [MLEM]
Python API or CLI, which automagically captures all the context needed to
distribute them. It can store model files on the cloud (by itself or with DVC),
list and transfer them within locations, wrap them as a local REST server, or
even containerize and deploy them to cloud providers!

This ecosystem of tools from [Iterative](https://iterative.ai/) brings your ML
process into [GitOps]. This means you can manage and deliver ML models with
software engineering methods such as continuous integration (CI/CD), which can
sync with the state of the artifacts in your registry.

[modeling process]: https://dvc.org/doc/start/data-pipelines
[remote storage]: https://dvc.org/doc/command-reference/remote
[sharing]: https://dvc.org/doc/start/data-and-model-access
[via cml]: https://cml.dev/doc/cml-with-dvc
[gitops]: https://www.gitops.tech/
Loading