This repository has been archived by the owner on Oct 16, 2024. It is now read-only.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
GTO docs #199
GTO docs #199
Changes from 12 commits
45d5377
5ba326e
6a33a05
faa1b53
f2f1527
217c675
7895da1
803e7cf
826f278
e3b90bd
eff9755
65f66ee
129b17f
c596dc9
9f42ace
931dac1
76dd044
c4485ae
f2e7af0
0f5b833
9eab27f
65c9f51
e5a75b3
570b77a
d747e56
21930be
0b25aec
68f2b98
cee3374
0c0931e
7da4618
f374516
6ddfb70
0f70d82
98a06c4
44d9e2f
2fb378e
d8fec09
892c1f7
2ab8a9b
8f9b6a3
c7ae5ff
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @francesco086. This is good point of view. I'm opening a thread based on your comment to keep a discussion re it in a single place.
Minor: If you check out DVC docs, you'll see there are docs for Studio and DVClive. We can put this "GTO documentation" there or here (I used mlem.ai cause it was easier for me). Or to a separate website, like
iterative.ai/doc
maybe? Not sure. We need some place to keep GTO docs anyway.Major: explaining how to build a registry with DVC+GTO+MLEM. Good question where to put that. In this PR you can see I was going to put answers in /doc/gto/user-guide. I guess the Tutorial format would be the best for this, and we could add it to each product involved under Use Cases (e.g. here it can be next or instead of "Pure MLEM Model registry"):
The other option is to create a GS with this - but that would be way to heavy for Get Started. I guess Tutorial or blog post serves the purpose better.
Another place to have this is Model Registry page in Studio docs. But, not sure yet how UI (Studio) and CLI (GTO+DVC+MLEM Tutorial) could co-exist here. Maybe cross-links are a better approach than having this in Studio docs.
Again, good topic to think about 🤔 We also leave CML out of the picture above, it also can be a part of a MR...
@tapadipti, have you had any discussion about setting up a DVC+GTO+MLEM Tutorial to complement Studio docs? Looks like it much needed, but I can't see we ever created something like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's a good place to put GTO docs if we want docs beyond a CMD/API ref (otherwise we could do with a README and possibly a site like https://docs.iterative.ai/dvc-task/reference/dvc_task/
We mention it very high-level in https://mlem.ai/doc/use-cases/model-registry now. And there's the https://iterative.ai/model-registry solution page separately. I'm not sure how much we want to go into the details of this 3-way integration. May be a good blog topic indeed. Let's create a separate issue to discuss that, though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://iterative.ai/model-registry should have links to all relevant docs pages. But since the docs can't reside there, Studio docs look like the next best place to me for explaining how to build a registry with DVC+GTO+MLEM. We could create a
Use cases
section. But depending on how much and what content we need, a blog post may also suffice. And docs specific to the GTO cli should definitely be separate.This is to be changed. We will host Studio docs separately in its own docs site (like CML) - although we don't have dates for this yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so I'm trying to draft that blog post - please see https://www.notion.so/iterative/Tutorial-Model-Registry-in-Git-with-DVC-MLEM-and-GTO-af124368ce9f4523a568a7e1875c7af3 - high-level feedback would be appreciated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @aguschin. I've left some comments in the draft blog post.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍🏼 👍🏼 👍🏼
I kind of like that we start by showing the end-result! It's a good way to deliver the value proposition quickly in here (main purpose of this doc).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not clear now why you need DB/Services at all - if we talk about GTO installation, let's remove all mentions of MR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the part about DBs because it didn't seem too relevant to mention in the installation page, but it may make sense in other docs.
Not sure I understood your suggestion wrt MR mentions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes me very curious, have I missed something? Given a mlem model, I will know from which experiment it comes from? how? you store the dvc experiment reference as metadata? Is it explained somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, there is no such thing as link from
.mlem
metafile to the exact experiment. I guess the idea behind this words is that you have a Git repo, and given a commit, you can get your ML experiment (DVC) and model metadata (MLEM) and a signal what is production-ready (GTO). Does it make sense now?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok, thanks for the clarification!
Perhaps this is a point to clarify somewhere. When I think of such tools I personally always imagine to have a separate repo where I store e.g. the mlem models. Why? Mainly because at a certain point I would like to train as part of ci/cd and avoid the creation of new commits to the repo itself as part of the ci/cd. I know it's possible via cml, but I would prefer to avoid the need altogether.
Perhaps it's only me...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Semantic versioning is the accepted way to version code. How should artifacts be versioned?
I have been asked this by a Data Scientist some time ago. Given that everyone is free to do whatever he wants, perhaps giving a hint is not bad...?
I formulated a reasonable convention for models, not sure if it could be of any use:
Patch
Model as a black-box is as before, it only outputs different numbers.
Typical scenario: model have been trained with more recent data
Typical scenario 2: changed hyper-parameters
Minor
May want to take advantage of additional outputs or additional functionalities
Typical scenario 1: model now has
predict_proba()
in addition topredict()
Typical scenario 2: model now outputs a json with an additional field
confidence_interval
, in addition topredicted_values
Major
Need to re-visit the code that calls the model to serve it (breaking change)
Typical scenario 1: model APIs have changed
Typical scenario 2: model expects different input data format
Typical scenario 3: model relies on different libraries, need to re-build the venv (or even the OS-level libraries)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea! We can turn this advice into a page in User Guide, e.g. "Semantic versioning for ML models". Not sure should it belong to Studio docs or GTO docs...
@jorgeorpinel, WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a discussion for https://mlem.ai/doc/use-cases/model-registry I think. No need to repeat the use case page in here (it's already in the same site). Link to it from GTO docs as needed instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't think it's for Use Cases - they're too high-level and these are details. IMHO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. I meant I didn't think this file we're commenting on was needed here (gone now).
On ML SemVer, Idk. Seems opinionated to give such a precise recommendation. @francesco086 I encourage you to make a separate PR to contribute this in some existing or new page though, then the team can review it and decide.
It would probably belongs in GTO docs (merging this PR soon). That's what we use to annotate artifact versions right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've created an issue to address this #231
Yes