From 9f3d85f2ed8b93bedb490901c4ad8dc1f1ee2ed9 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Perez Date: Tue, 24 May 2022 02:45:09 -0600 Subject: [PATCH 1/7] nav update --- content/docs/sidebar.json | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/content/docs/sidebar.json b/content/docs/sidebar.json index da9464b1..647ba84f 100644 --- a/content/docs/sidebar.json +++ b/content/docs/sidebar.json @@ -74,16 +74,16 @@ "label": "Basic concepts", "source": "user-guide/basic-concepts.md" }, - { - "slug": "datasets", - "label": "Working with datasets", - "source": "user-guide/datasets.md" - }, { "slug": "project-structure", "label": "Project structure", "source": "user-guide/project-structure.md" }, + { + "slug": "datasets", + "label": "Working with datasets", + "source": "user-guide/datasets.md" + }, { "slug": "remote-repos", "label": "Working with repositories and remote objects", From 9b60b8c1fca3a94b6ae98619d8f2cf3ea2a6cf44 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Perez Date: Tue, 24 May 2022 02:48:21 -0600 Subject: [PATCH 2/7] remoge datasets and remote-repos guides --- content/docs/sidebar.json | 10 -- content/docs/user-guide/datasets.md | 113 --------------------- content/docs/user-guide/remote-repos.md | 128 ------------------------ 3 files changed, 251 deletions(-) delete mode 100644 content/docs/user-guide/datasets.md delete mode 100644 content/docs/user-guide/remote-repos.md diff --git a/content/docs/sidebar.json b/content/docs/sidebar.json index 647ba84f..f2c7a376 100644 --- a/content/docs/sidebar.json +++ b/content/docs/sidebar.json @@ -79,16 +79,6 @@ "label": "Project structure", "source": "user-guide/project-structure.md" }, - { - "slug": "datasets", - "label": "Working with datasets", - "source": "user-guide/datasets.md" - }, - { - "slug": "remote-repos", - "label": "Working with repositories and remote objects", - "source": "user-guide/remote-repos.md" - }, { "slug": "configuration", "label": "Configuration", diff --git a/content/docs/user-guide/datasets.md b/content/docs/user-guide/datasets.md deleted file mode 100644 index 9dfbb512..00000000 --- a/content/docs/user-guide/datasets.md +++ /dev/null @@ -1,113 +0,0 @@ -# Working with datasets - -## Getting the data - -The first step is to get some data. For this tutorial, we’ll just generate it. -Let's take a look at this python script: - -```py -# prepare.py -from mlem.api import save -from sklearn.datasets import load_iris -from sklearn.model_selection import train_test_split - -def main(): - data, y = load_iris(return_X_y=True, as_frame=True) - data["target"] = y - train_data, test_data = train_test_split(data, random_state=42) - save(train_data, "train.csv") - save(test_data.drop("target", axis=1), "test_x.csv") - save(test_data[["target"]], "test_y.csv") - -if __name__ == "__main__": - main() -``` - -Here we load the well-known iris dataset with sklearn, and then save parts of it -with MLEM. For now, we just save them locally and push them to Git later. - -Let's execute this script and see what was produced: - -```cli -$ python prepare.py -$ tree .mlem/dataset/ -.mlem/dataset/ -├── test_x.csv -├── test_x.csv.mlem -├── test_y.csv -├── test_y.csv.mlem -├── train.csv -└── train.csv.mlem -``` - -What we see here is that every DataFrame was saved along with some metadata -about it. Let's see one example: - -```cli -$ head -5 .mlem/dataset/train.csv -,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target -4,5.0,3.6,1.4,0.2,0 -32,5.2,4.1,1.5,0.1,0 -142,5.8,2.7,5.1,1.9,2 -85,6.0,3.4,4.5,1.6,1 -``` - -
- -### `$ cat .mlem/dataset/train.csv.mlem` - -```yaml -artifacts: - data: - hash: add43029d2b464d0884a7d3105ef0652 - size: 2459 - uri: train.csv -object_type: dataset -reader: - dataset_type: - columns: - - '' - - sepal length (cm) - - sepal width (cm) - - petal length (cm) - - petal width (cm) - - target - dtypes: - - int64 - - float64 - - float64 - - float64 - - float64 - - int64 - index_cols: - - '' - type: dataframe - format: csv - type: pandas -requirements: - - module: pandas - version: 1.4.2 -``` - -
- -We can see here what was saved: dataset schema and requirements on the libraries -which were used to save the dataset. That doesn't mean you can't read that -`train` any other way, but if you would use MLEM to load it, it would know that -it needs pandas to do that for you. - -Note that we didn't specify whether the saved dataset was `pd.DataFrame`, -`np.array` or `tf.Tensor`. MLEM is getting that for you, and this handy magic -extends to ML models 👋 - -
- -### ⛳ [Data prepared](https://github.com/iterative/example-mlem-get-started/tree/2-prepare) - -```cli -$ git add .mlem -$ git commit -m "Added data" -$ git diff 2-prepare -``` - -
diff --git a/content/docs/user-guide/remote-repos.md b/content/docs/user-guide/remote-repos.md deleted file mode 100644 index 5715da97..00000000 --- a/content/docs/user-guide/remote-repos.md +++ /dev/null @@ -1,128 +0,0 @@ -# Working with repositories and remote objects - -
- -### 🧳 Requirements - -We need to install DVC since model binaries in the remote example repo are -stored in the cloud remote with DVC’s help. In another section we’ll show how -MLEM works with DVC in more details. - -`pip install dvc[s3]` - -
- -## Listing objects - -Since we've saved the data and the model in the repository, let's list them: - -```cli -$ mlem ls -``` - -```yaml -Datasets: - - test_x.csv - - test_y.csv - - train.csv -Models: - - rf -``` - -Note that we are actually listing models and data which is saved in the -repository we're in. - -But what if they are stored in a remote Git repository, and we don't want to -clone it? MLEM can also work with remote repositories: - -```cli -$ mlem ls https://github.com/iterative/example-mlem-get-started --type model -``` - -```yaml -Models: - - rf -``` - -We also can use URL addresses to load models from remote repositories directly: - -```py -from mlem.api import load - -model = load("https://github.com/iterative/example-mlem-get-started/rf") -# or -model = load( - "rf", - repo="https://github.com/iterative/example-mlem-get-started", - rev="main" -) -``` - -If we just want to download the model to a local disk to use it later, we can -run `clone` command - -```cli -$ mlem clone https://github.com/iterative/example-mlem-get-started/rf ml_model -``` - -The other way to do it is to run - -```cli -$ mlem clone rf --repo https://github.com/iterative/example-mlem-get-started --rev main ml_model -``` - -
- -### 💡 Expand to use your own repo - -We use [example repo](https://github.com/iterative/example-mlem-get-started) in -the commands, but you can create your own repo and use it if you want. - -To push your models and datasets to the repo, add them to Git and commit - -```cli -$ git add .mlem *.py -$ git commit -am "committing mlem objects and code" -$ git push -``` - -
- -## Cloud remotes - -If you don’t have the need to version your models, but you want to store your -objects in some remote location, you can use MLEM with any cloud/remote -supported by -[fsspec](https://filesystem-spec.readthedocs.io/en/latest/api.html#built-in-implementations), -e.g. s3. - -To do so, use paths with corresponding file system protocol and path like -`s3:///` - -```cli -$ mlem init s3://example-mlem-get-started -$ mlem clone rf s3://example-mlem-get-started/rf -⏳️ Loading meta from .mlem/model/rf.mlem -🐏 Cloning .mlem/model/rf.mlem -💾 Saving model to s3://example-mlem-get-started/.mlem/model/rf.mlem -``` - -Now you can load this model via API or use it in CLI commands just like if it -was local: - -```py -from mlem.api import load -model = load("rf", repo="s3://example-mlem-get-started") -``` - -```cli -$ mlem apply rf --repo s3://example-mlem-get-started test_x.csv --json -[1, 0, 2, 1, 1, 0, 1, 2, 1, 1, 2, 0, 0, 0, 0, 1, 2, 1, 1, 2, 0, 2, 0, 2, 2, 2, 2, 2, 0, 0, 0, 0, 1, 0, 0, 2, 1, 0] -``` - -TL;DR: we've just - -1. Listed all MLEM models in the Git repo, -2. Loaded model from Git repo directly, -3. Initialized MLEM in remote bucket and worked with just like with a regular - folder. From 2f85b77869dbeac2ac0f93a1134178d72f987fca Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Perez Date: Tue, 24 May 2022 03:12:42 -0600 Subject: [PATCH 3/7] term: .mlem/ dir vs .mlem file --- content/docs/api-reference/import_object.md | 4 ++-- content/docs/api-reference/init.md | 2 +- content/docs/command-reference/import.md | 6 +++--- content/docs/command-reference/init.md | 2 +- content/docs/get-started/saving.md | 4 ++-- content/docs/use-cases/dvc.md | 7 +++---- content/docs/user-guide/basic-concepts.md | 8 ++++---- content/docs/user-guide/project-structure.md | 3 ++- 8 files changed, 18 insertions(+), 18 deletions(-) diff --git a/content/docs/api-reference/import_object.md b/content/docs/api-reference/import_object.md index a9a320c9..1d8bba92 100644 --- a/content/docs/api-reference/import_object.md +++ b/content/docs/api-reference/import_object.md @@ -58,8 +58,8 @@ command. 'pandas']. Defaults to auto-infer. - `copy_data` (optional) - Whether to create a copy of file in target location or just link existing file. Defaults to True. -- `external` (optional) - Save result not in `.mlem`, but directly in repo -- `index` (optional) - Whether to index output in `.mlem` directory +- `external` (optional) - Save result directly in repo (not in `.mlem/`) +- `index` (optional) - Whether to index output in `.mlem/` directory ## Exceptions diff --git a/content/docs/api-reference/init.md b/content/docs/api-reference/init.md index d4ec2ce5..207c75c0 100644 --- a/content/docs/api-reference/init.md +++ b/content/docs/api-reference/init.md @@ -1,6 +1,6 @@ # mlem.api.init() -Creates `.mlem/` directory in `path` +Creates and populates the `.mlem/` directory in `path`. ```py def init(path: str = ".") -> None diff --git a/content/docs/command-reference/import.md b/content/docs/command-reference/import.md index 9dc47a72..3d9f4c3d 100644 --- a/content/docs/command-reference/import.md +++ b/content/docs/command-reference/import.md @@ -15,9 +15,9 @@ TARGET Path to save MLEM object [required] ## Description Use `import` on an existing datasets or model files (or directories) to -auto-generate the necessary MLEM metadata (`.mlem`) files for them. This is -useful to quickly make existing datasets and model files compatible with MLEM, -which can then be used in future operations such as `mlem apply`. +auto-generate the necessary MLEM metadata (`.mlem` extension) files for them. +This is useful to quickly make existing datasets and model files compatible with +MLEM, which can then be used in future operations such as `mlem apply`. This command provides a quick and easy alternative to writing python code to load those models/datasets into object for subsequent usage in MLEM context. diff --git a/content/docs/command-reference/init.md b/content/docs/command-reference/init.md index 0c11cdf6..2de742f9 100644 --- a/content/docs/command-reference/init.md +++ b/content/docs/command-reference/init.md @@ -13,7 +13,7 @@ arguments: [PATH] Target path to workspace ## Description The `init` command (without given `path`) defaults to the current directory for -the path argument. This creates a `.mlem` directory and an empty `config.yaml` +the path argument. This creates a `.mlem/` directory and an empty `config.yaml` file inside it. Although we recommend using MLEM within a Git repository to track changes using diff --git a/content/docs/get-started/saving.md b/content/docs/get-started/saving.md index 98c23aec..47536a9c 100644 --- a/content/docs/get-started/saving.md +++ b/content/docs/get-started/saving.md @@ -56,8 +56,8 @@ $ tree .mlem/model/ > reference. What we see here is that model was saved along with some metadata about it: `rf` -containing the model binary and `.mlem` file containing metadata. Let's take a -look at it: +containing the model binary and a `.mlem` file containing its metadata. Let's +take a look at it:
diff --git a/content/docs/use-cases/dvc.md b/content/docs/use-cases/dvc.md index d768633f..cd28d434 100644 --- a/content/docs/use-cases/dvc.md +++ b/content/docs/use-cases/dvc.md @@ -49,7 +49,7 @@ $ mlem config set default_storage.type dvc ``` Also, let’s add `.mlem` files to `.dvcignore` so that metafiles are ignored by -DVC +DVC. ```cli $ echo "/**/?*.mlem" > .dvcignore @@ -90,9 +90,8 @@ can process your data and train your model. You may be already training your ML models in them and what to start using MLEM to save those models. MLEM could be easily plug in into existing DVC pipelines. If you already added -`.mlem` files to `.dvcignore`, you are good to go for most of the cases. Since -DVC will ignore `.mlem` files, you don't need to add them as outputs and mark -them with `cache: false`. +`.mlem` files to `.dvcignore`, you are good to go for most of the cases (no need +to make them into `cache: false` outputs). It becomes a bit more complicated when you need to add them as outputs, because you want to use them as inputs to next stages. The case may be when model binary diff --git a/content/docs/user-guide/basic-concepts.md b/content/docs/user-guide/basic-concepts.md index affae667..5c7ef3d0 100644 --- a/content/docs/user-guide/basic-concepts.md +++ b/content/docs/user-guide/basic-concepts.md @@ -12,13 +12,13 @@ datasets and other types you can read about below. > Also, MLEM Objects can be created with > [`mlem create`](/doc/command-reference/create) CLI command -MLEM Objects are saved as `.mlem` files in `yaml` format. Sometimes they can +MLEM Objects are saved as `.mlem` files in YAML format. Sometimes they can have other files attached to them, in that case we call `.mlem` file as a "metadata file" or "metafile" and all the other files we call "artifacts". -Typically, if **MLEM Object** have only one artifact, it will have the same name -without `.mlem` extension, for example `model.mlem` + `model`, or `data.csv` + -`data.csv.mlem`. +Typically, if **MLEM Object** have only one artifact, it will have the same file +name without `.mlem` extension, for example `model.mlem` and `model`, or +`data.csv` and `data.csv.mlem`. If **MLEM Object** have multiple artifacts, they will be stored in a directory with the same name, for example `model.mlem` + `model/data.pkl` + diff --git a/content/docs/user-guide/project-structure.md b/content/docs/user-guide/project-structure.md index bf77d1e2..f035fb78 100644 --- a/content/docs/user-guide/project-structure.md +++ b/content/docs/user-guide/project-structure.md @@ -8,7 +8,8 @@ To create one, use [`mlem init`](/doc/command-reference/init) or `config.yaml` (see [Configuration](/doc/user-guide/configuration)). > Some API and CLI commands like `mlem ls` and `mlem config` require this -> execution context. But in general, MLEM can work with `.mlem` files anywhere. +> execution context. But in general, MLEM can work with `.mlem` metafiles +> anywhere. A common place to initialize MLEM is a data science Git repository. _MLEM repositories_ help you better structure and easily address existing data From ff2c49ca3fc5ed1296bb252fe731c080f29bcb2d Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Perez Date: Tue, 24 May 2022 03:20:44 -0600 Subject: [PATCH 4/7] Lint --- content/docs/api-reference/import_object.md | 2 +- content/docs/user-guide/basic-concepts.md | 6 +++--- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/content/docs/api-reference/import_object.md b/content/docs/api-reference/import_object.md index 1d8bba92..796fce2a 100644 --- a/content/docs/api-reference/import_object.md +++ b/content/docs/api-reference/import_object.md @@ -58,7 +58,7 @@ command. 'pandas']. Defaults to auto-infer. - `copy_data` (optional) - Whether to create a copy of file in target location or just link existing file. Defaults to True. -- `external` (optional) - Save result directly in repo (not in `.mlem/`) +- `external` (optional) - Save result directly in repo (not in `.mlem/`) - `index` (optional) - Whether to index output in `.mlem/` directory ## Exceptions diff --git a/content/docs/user-guide/basic-concepts.md b/content/docs/user-guide/basic-concepts.md index 5c7ef3d0..91b1d724 100644 --- a/content/docs/user-guide/basic-concepts.md +++ b/content/docs/user-guide/basic-concepts.md @@ -12,9 +12,9 @@ datasets and other types you can read about below. > Also, MLEM Objects can be created with > [`mlem create`](/doc/command-reference/create) CLI command -MLEM Objects are saved as `.mlem` files in YAML format. Sometimes they can -have other files attached to them, in that case we call `.mlem` file as a -"metadata file" or "metafile" and all the other files we call "artifacts". +MLEM Objects are saved as `.mlem` files in YAML format. Sometimes they can have +other files attached to them, in that case we call `.mlem` file as a "metadata +file" or "metafile" and all the other files we call "artifacts". Typically, if **MLEM Object** have only one artifact, it will have the same file name without `.mlem` extension, for example `model.mlem` and `model`, or From c88802f4f6804e44b2b29719f9cb9d787b3adecd Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Perez Date: Tue, 24 May 2022 03:43:36 -0600 Subject: [PATCH 5/7] term: metadata vs. metafile --- content/docs/api-reference/save.md | 2 +- content/docs/command-reference/create.md | 13 ++++---- .../docs/command-reference/deploy/index.md | 2 +- content/docs/command-reference/import.md | 10 +++--- content/docs/command-reference/pprint.md | 4 +-- content/docs/get-started/saving.md | 6 ++-- content/docs/use-cases/dvc.md | 32 ++++++++++--------- content/docs/user-guide/basic-concepts.md | 6 ++-- content/docs/user-guide/mlem-abcs.md | 2 +- 9 files changed, 39 insertions(+), 38 deletions(-) diff --git a/content/docs/api-reference/save.md b/content/docs/api-reference/save.md index 4422e47c..b9eb4a32 100644 --- a/content/docs/api-reference/save.md +++ b/content/docs/api-reference/save.md @@ -40,7 +40,7 @@ systems (eg: `S3`). The function returns and saves the object as a - `repo` (optional) - path to mlem repo - `sample_data` (optional) - If the object is a model or function, you can provide input data sample, so MLEM will include it's schema in the model's - metadata + metafile - `fs` (optional) - FileSystem for the `path` argument - `index` (optional) - Whether to add object to mlem repo index - `external` (optional) - if obj is saved to repo, whether to put it outside of diff --git a/content/docs/command-reference/create.md b/content/docs/command-reference/create.md index 6d15c131..ebbab9e7 100644 --- a/content/docs/command-reference/create.md +++ b/content/docs/command-reference/create.md @@ -1,7 +1,7 @@ # create Creates a new [MLEM Object](/doc/user-guide/basic-concepts#mlem-objects) -metafile from conf args and config files. +metafile from config args and config files. ## Synopsis @@ -16,10 +16,9 @@ PATH Where to save object [required] ## Description -Metadata files (with `.mlem` file extension) can be created for +`.mlem` metafiles can be created for [MLEM Objects](/doc/user-guide/basic-concepts#mlem-objects) using this command. -This is particularly useful in filling up configuration values for environments -and deployments. +This is particularly useful for configuring environments and deployments. Each MLEM Object, along with its subtype (which represents a particular implementation), will accept different configuration arguments. The list of @@ -38,10 +37,10 @@ check out the last example [here](/doc/command-reference/types#examples) ## Examples -Create an environment metafile with a config key +Create an environment object metafile with a config key: ```cli -# Fetch all config arguments which can be passed for a heroku env +# Fetch all available config args for a heroku env $ mlem types env heroku [not required] api_key: str = None @@ -49,7 +48,7 @@ $ mlem types env heroku $ mlem create env heroku production --conf api_key="mlem_heroku_staging" 💾 Saving env to .mlem/env/staging.mlem -# print the contents of the saved metafile for the heroku env +# Print the contents of the new heroku env metafile $ cat .mlem/env/staging.mlem api_key: mlem_heroku_staging object_type: env diff --git a/content/docs/command-reference/deploy/index.md b/content/docs/command-reference/deploy/index.md index 74bee919..ea2b473c 100644 --- a/content/docs/command-reference/deploy/index.md +++ b/content/docs/command-reference/deploy/index.md @@ -25,7 +25,7 @@ serving a specific model, using a specific environment definition, and running on a target platform. MLEM deployments allow `applying` methods and even whole datasets on models. -Each model lists its supported methods in its metafile, and those are +Each model lists its supported methods in its `.mlem` metafile, and those are automatically used by MLEM to wire and expose endpoints on the application server upon deployment. Applying datasets on the deployment is a very handy shortcut of bulk inferring data on the served model. diff --git a/content/docs/command-reference/import.md b/content/docs/command-reference/import.md index 3d9f4c3d..70d8d815 100644 --- a/content/docs/command-reference/import.md +++ b/content/docs/command-reference/import.md @@ -1,6 +1,6 @@ # import -Create a MLEM model or dataset metadata from a file/directory. +Create a `.mlem` metafile for a model or dataset in any file or directory. ## Synopsis @@ -14,10 +14,10 @@ TARGET Path to save MLEM object [required] ## Description -Use `import` on an existing datasets or model files (or directories) to -auto-generate the necessary MLEM metadata (`.mlem` extension) files for them. -This is useful to quickly make existing datasets and model files compatible with -MLEM, which can then be used in future operations such as `mlem apply`. +Use `import` on an existing datasets or model files (or directories) to generate +the necessary `.mlem` metafiles for them. This is useful to quickly make +existing datasets and model files compatible with MLEM, which can then be used +in future operations such as `mlem apply`. This command provides a quick and easy alternative to writing python code to load those models/datasets into object for subsequent usage in MLEM context. diff --git a/content/docs/command-reference/pprint.md b/content/docs/command-reference/pprint.md index abce41b0..0cb70bd6 100644 --- a/content/docs/command-reference/pprint.md +++ b/content/docs/command-reference/pprint.md @@ -15,8 +15,8 @@ arguments: PATH Path to object [required] ## Description All MLEM objects can be printed to view their metadata. This includes generic -metadata information such as requirements, type of object, hash, size, as well -as object specific information such as `methods` for a `model` or `reader` for a +information such as requirements, type of object, hash, size, as well as object +specific information such as `methods` for a `model` or `reader` for a `dataset`. Since only one specific object is printed, a `PATH` to the specific MLEM object diff --git a/content/docs/get-started/saving.md b/content/docs/get-started/saving.md index 47536a9c..62a4e159 100644 --- a/content/docs/get-started/saving.md +++ b/content/docs/get-started/saving.md @@ -55,9 +55,9 @@ $ tree .mlem/model/ > changed, see [project structure](/doc/user-guide/project-structure) for > reference. -What we see here is that model was saved along with some metadata about it: `rf` -containing the model binary and a `.mlem` file containing its metadata. Let's -take a look at it: +The model was saved along with some metadata about it: `rf` containing the model +binary and a `.mlem` metafile containing information about it. Let's take a look +at it:
diff --git a/content/docs/use-cases/dvc.md b/content/docs/use-cases/dvc.md index cd28d434..7bfb9195 100644 --- a/content/docs/use-cases/dvc.md +++ b/content/docs/use-cases/dvc.md @@ -66,15 +66,18 @@ $ git rm -r --cached .mlem/ $ python train.py ``` -Finally, let’s add new metafiles to Git and artifacts to DVC respectively, -commit and push them +Finally, let’s add and commit new metafiles to Git and artifacts to DVC, +respectively: ```cli $ dvc add .mlem/model/rf .mlem/dataset/*.csv $ git add .mlem $ git commit -m "Switch to dvc storage" +... + $ dvc push -r myremote $ git push +... ``` Now, you can load MLEM objects from your repo even though there are no actual @@ -89,17 +92,16 @@ DVC pipelines are the useful DVC mechanism to build data pipelines, in which you can process your data and train your model. You may be already training your ML models in them and what to start using MLEM to save those models. -MLEM could be easily plug in into existing DVC pipelines. If you already added -`.mlem` files to `.dvcignore`, you are good to go for most of the cases (no need -to make them into `cache: false` outputs). +MLEM can be easily plugged into existing DVC pipelines. If you already added +`.mlem` files to `.dvcignore`, you are good to go for most of the cases. -It becomes a bit more complicated when you need to add them as outputs, because -you want to use them as inputs to next stages. The case may be when model binary -doesn't change for you, but model metadata does. That may happen if you change -things like model description or labels. +It becomes a bit more complicated when you need to add them as inputs to +pipeline stages. For example, when a model binary doesn't change, but its +metadata (e.g. model description or labels) does. things like model description +or labels. To work with that, you'll need to remove `.mlem` files from `.dvcignore` and -mark your outputs in DVC Pipeline with `cache: false`. +make them `cache: false` outputs in the pipeline. ## Example @@ -117,7 +119,8 @@ stages: ``` Next step would be to start saving your models with MLEM. Since MLEM saves both -**binary** and **metadata** you need to have both of them in DVC pipeline: +the binary and corresponding `.mlem` metafile, you need to have both of them in +the DVC pipeline: ```yaml # dvc.yaml @@ -132,9 +135,8 @@ stages: cache: false ``` -Since binary was already captured before, we don't need to add anything for it. -For metadata, we've added two rows to capture it and specify `cache: false` -since we want the metadata to be committed to Git, and not be pushed to DVC -remote. +The binary was already in, so there's no need to add it again. For the metafile, +we've added two rows and specify `cache: false` to track it with DVC while +storing it in Git. Now MLEM is ready to be used in your DVC pipeline! diff --git a/content/docs/user-guide/basic-concepts.md b/content/docs/user-guide/basic-concepts.md index 91b1d724..c67907fa 100644 --- a/content/docs/user-guide/basic-concepts.md +++ b/content/docs/user-guide/basic-concepts.md @@ -12,9 +12,9 @@ datasets and other types you can read about below. > Also, MLEM Objects can be created with > [`mlem create`](/doc/command-reference/create) CLI command -MLEM Objects are saved as `.mlem` files in YAML format. Sometimes they can have -other files attached to them, in that case we call `.mlem` file as a "metadata -file" or "metafile" and all the other files we call "artifacts". +MLEM Objects are saved as special _metafiles_ in YAML format with the `.mlem` +extension. These may or may not have _artifacts_ (other files or directories) +associated. Typically, if **MLEM Object** have only one artifact, it will have the same file name without `.mlem` extension, for example `model.mlem` and `model`, or diff --git a/content/docs/user-guide/mlem-abcs.md b/content/docs/user-guide/mlem-abcs.md index 074733cd..62cb4a76 100644 --- a/content/docs/user-guide/mlem-abcs.md +++ b/content/docs/user-guide/mlem-abcs.md @@ -123,7 +123,7 @@ will be pickled, and NN will be saved using `torch_io` ## DatasetType -Hold metadata about dataset, like type, dimensions, column names etc. +Holds metadata about dataset, like type, dimensions, column names etc. **Base class**: `mlem.core.dataset_type.DatasetType` From 6e601dc8a10dd10ab9155d0abb6b9e9d36017ea5 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Wed, 25 May 2022 13:43:13 -0500 Subject: [PATCH 6/7] Update content/docs/use-cases/dvc.md --- content/docs/use-cases/dvc.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/docs/use-cases/dvc.md b/content/docs/use-cases/dvc.md index 7bfb9195..01e05914 100644 --- a/content/docs/use-cases/dvc.md +++ b/content/docs/use-cases/dvc.md @@ -93,7 +93,7 @@ can process your data and train your model. You may be already training your ML models in them and what to start using MLEM to save those models. MLEM can be easily plugged into existing DVC pipelines. If you already added -`.mlem` files to `.dvcignore`, you are good to go for most of the cases. +`.mlem` files to `.dvcignore`, you are good to go in most cases. It becomes a bit more complicated when you need to add them as inputs to pipeline stages. For example, when a model binary doesn't change, but its From 63d58b22acdb94a7e358f9afab6e473ba75d0268 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Wed, 25 May 2022 14:09:26 -0500 Subject: [PATCH 7/7] cases: simplify DVC integration matching https://github.com/iterative/mlem.ai/pull/79#discussion_r882017743 --- content/docs/use-cases/dvc.md | 12 +++--------- 1 file changed, 3 insertions(+), 9 deletions(-) diff --git a/content/docs/use-cases/dvc.md b/content/docs/use-cases/dvc.md index 01e05914..bfd0ac71 100644 --- a/content/docs/use-cases/dvc.md +++ b/content/docs/use-cases/dvc.md @@ -93,15 +93,9 @@ can process your data and train your model. You may be already training your ML models in them and what to start using MLEM to save those models. MLEM can be easily plugged into existing DVC pipelines. If you already added -`.mlem` files to `.dvcignore`, you are good to go in most cases. - -It becomes a bit more complicated when you need to add them as inputs to -pipeline stages. For example, when a model binary doesn't change, but its -metadata (e.g. model description or labels) does. things like model description -or labels. - -To work with that, you'll need to remove `.mlem` files from `.dvcignore` and -make them `cache: false` outputs in the pipeline. +`.mlem` files to `.dvcignore`, you are good to go. Otherwise you'll need to +mark `.mlem` files as `cache: false` [outputs] of a pipelines stage. +[outputs]: https://dvc.org/doc/user-guide/project-structure/pipelines-files#output-subfields ## Example