Skip to content
This repository has been archived by the owner on Sep 13, 2023. It is now read-only.

MLEM index and .mlem/ folder #381

Closed
aguschin opened this issue Aug 17, 2022 · 14 comments · Fixed by #395
Closed

MLEM index and .mlem/ folder #381

aguschin opened this issue Aug 17, 2022 · 14 comments · Fixed by #395
Labels
design MLEM design question and decisions .mlem/ Everything inside .mlem folder ❓ type: question A question about MLEM

Comments

@aguschin
Copy link
Contributor

Feedback from @daavoo

Why store everything in .mlem? How can I choose the output folder?
I discovered external in project structure page, perhaps worth explicit mention?

This brings us back to the discussion of having an index in MLEM. I know @shcheklein and @mike0sv had conversations about this. So, have you decided something on this? From what I recall:

  1. In GS we should delete executing $ mlem init to untie MLEM usage from MLEM project (we have a ticket for it in mlem.ai repo)
  2. To support mlem ls without .mlem/ we need to implement repo parsing (just like DVC I suppose)
@aguschin aguschin added .mlem/ Everything inside .mlem folder ❓ type: question A question about MLEM labels Aug 17, 2022
@shcheklein
Copy link
Member

I think we should not be doing mlem ls. It feels that GTO (or DVC?) should handle the indexing / discovery part. MLEM should stay focused on two things - packaging models + deploying them. What are the implications if we decide to drop the index?

@mike0sv
Copy link
Contributor

mike0sv commented Aug 21, 2022

Interestingly, rn ls is the most used cli command :)
I think we should revisit this question after we do more work on gto-mlem integration

@dmpetrov
Copy link
Member

ls is the most use command because it is a way to procrastinate when I don't know what to do 😄 I'd not consider this as a strong signal.

.mlem dir goes a bit agains of principle of codification because .mlem introduce a special internal structure (like a database) while codification assumes all the artifacts are user visible and actionable. If you have a visible artifact then you won't need mlem ls - a simple ls is enough to procrastinate.

I'd suggest to remove the dir with the ls command. If you find it useful it can be replaced by something like find but without the internal structure.

@aguschin
Copy link
Contributor Author

IMO, ls/find may be useful when you discover a new repo with MLEM models and want to discover all of them. Of if you use a s3 bucket as a intermediate storage right before deploy and want to list MLEM models there.

Also, what are the reasons for DVC to have dvc ls?

@aguschin
Copy link
Contributor Author

related to #247

@shcheklein
Copy link
Member

Also, what are the reasons for DVC to have dvc ls

DVC has data management layer and ls serves as a regular ls extension to see the files that are not pulled yet. Feels like a different use case completely.

And even in that case DVC still doesn't have an index for good or bad.

@mike0sv
Copy link
Contributor

mike0sv commented Sep 1, 2022

MLEM also has a concept of project, which is tied with .mlem dir, ls and config.yaml and indexing.
At first I thought if we get rid of .mlem dir, we will need to get rid of projects.
It is very complex since they are in the very core of mlem code. Also, users will need to reference objects relatively in one repo, e.g deployment configuration in <repo>/deployments/aaa/bbb.mlem will have a field model: ../../models/aaa/bbb. And it will fail if we run it from other dir.
But we can retain projects by switching their logic from dir that have .mlem dir to dir that have .mlem.yaml (ex-.mlem/config.yaml)

@mike0sv
Copy link
Contributor

mike0sv commented Sep 12, 2022

To summarize a bit
What we have now:
MLEM Project is any fsspec location (local, github, s3 etc) that has .mlem directory. This directory also contains config.yaml that sets project-wide configuration (but everything has defaults so it can be empty)
If you dont have project: save(path, obj) will save artifacts to path (as file if single artifact and dir if multiple artifacts) and metadata in path.mlem file. load(path) (and consequently everything that uses load like links or args to cli commands) will look for .mlem file in path (or in path.mlem if path does not end with .mlem)
If you have project:
Both save and load will look for .mlem dir recursively up the tree (so if you save to /a/b/c/d and /a/b is a project it is the same as save to c/d with project=/a/b).
When saving to project, by default objects will not be saved to path, but instead to <project_root>/.mlem/<object_type>/path. Eg if you save model as aaa/bbb in curdir that is project root, actual files will be at ./.mlem/model/aaa/bbb and ./.mlem/model/aaa/bbb.mlem.
Also having project enables indexing. There are different index implementations (more on it a bit later)
Save also have 2 additional boolean options: external and index. external=True allows you to save objects to path instead of .mlem/<object_type>/path. index=False will disable indexing this object. By default external=False, index=True (actual logic a bit more complex depending on existence of the project). You can also change defaults for those options in config.yaml
Now to load(path). If path is under a project root, it will be "split" into project and relative path inside of project (lets call it relpath so path=join(project, relpath)). You can also provide project option to load in that case relpath is just path.
load will look for both <root>/relpath and <root>/.mlem/<object_type>/relpath so even objects saved with external=True can be accessed. Of course if relpath already have .mlem/<object_type> load will only look there. So you can use any combination of save(path, external=any) and load(path), even though actual files might not be in path but under .mlem dir instead.
One important feature of this logic is that if you load a link from remote project and it's path is not "absolute", mlem will treat it as relative to the same project. Eg when you load https://github.com/a/b/link.mlem with path=c inside mlem will load https://github.com/a/b/c.mlem and not local ./c.mlem file.
Now to index. Index can 1) add objects to itself 2) list objects inside. There are multiple implementations of it (and users can also provide their own). But the default one is LinkIndex. listing is implemented by going through .mlem directory and looking for all .mlem/<object_type>/*.mlem files. Indexing is doing nothing if object is saved to .mlem/<object_type> because it is "indexed" automatically. If external=True, index=True, a link object will be placed at <root>/.mlem/<object_type>/relpath.mlem that references object in <root>/relpath.
Another index implementation is FileIndex which will just have a list of saved objects inside a .mlem/index.yaml file

What I changed in PR above:
<root>/.mlem/config.yaml is moved to <root>/.mlem.yaml. MLEM Project is now determined by having this file instead of .mlem directory.
There is no more external and index options. Objects are always saved directly in path without funny .mlem/<object_type> business.
Since there is no more index, you cannot list objects in project.

Pros:

  • logic is a lot simpler, less code
  • no implicit path swapping
    Cons:
  • no mlem ls
    • for users, no easy way to see whats in project. This is really helpful when you explore projects of others or even to remind yourself what's in yours
    • for studio, we will have to implement repo scanning
  • no "enforced" project structure, which lead to less clean repo if user do not structure it themselves. I should remind you that not only models and data are MLEM Objects, but also deployment declarations for example (and some other stuff)

Alternative:
Don't get rid of .mlem dir, just make another abstraction that can augment path while saving or loading (eg adding .mlem/<object_type> prefix). Provide default implementation that does nothing. Kinda best of two worlds, but will take time to implement

@mike0sv
Copy link
Contributor

mike0sv commented Sep 12, 2022

@shcheklein @dmpetrov @omesser @aguschin

@shcheklein
Copy link
Member

Thanks @mike0sv ! From reading this, it makes sense to simplify, drop it, simplify it.

To clarify a few things:

no mlem ls

it's just slower, right? (and there are ways to speedup it even if we decide not to keep an explicit directory like .mlem).

that has .mlem directory.

when using Git this whole directory is saved into Git, right?

for studio, we will have to implement repo scanning

we can rely on GTO (or DVC if we move index there) and show only registered objects? (I agree though that the full list would be good to have). Full scan is probably not a big issue (we do this already in DVC). At least for now.

but also deployment declarations for example (and some other stuff)

that's a bit separate discussion, but may be we can move deployments / envs into a single file (mlem.yaml)?


Pros/cons that come to my mind (not necessarily major, just everything from the top of my head):

  • having objects under .mlem location will make workflows with GH and DVC a bit weird? We will have dependencies/outputs in a hidden directory. On GH we'll see and merge changes in a hidden directory.
  • hidden directories in general would have files that you don't interact directly / or do it rarely. Model files (esp metadata) probably better to be always external.

will think more about this ⌛

@aguschin
Copy link
Contributor Author

@shcheklein thanks for the feedback! To what I know:

it's just slower, right? (and there are ways to speedup it even if we decide not to keep an explicit directory like .mlem).

AFAIK, yes, it's slower. And rn we release 0.3.0 without mlem ls (I doubt Mike can implement it before release), but add it in few weeks after.

when using Git this whole directory is saved into Git, right?

Yes.

we can rely on GTO (or DVC if we move index there) and show only registered objects?

Yes, as an option.

that's a bit separate discussion, but may be we can move deployments / envs into a single file (mlem.yaml)?

Do you mean moving ALL envs and ALL deployments to a single file (e.g. <root>/.mlem.yaml)? IMO it could be useful to have them in separate files, e.g. if you want to make changes to dev env/deployments, and don't want to mess up prod env/deployments. Or you want to update model-a deployment, but don't want to mess up with model-b.

@mike0sv
Copy link
Contributor

mike0sv commented Sep 14, 2022

no mlem ls

I'd say if we dont have index, dont act like we have one. We can implement repo scanning as internal api used only by studio and dont expose it to cli. lsing remote repos without index can lead to breaking request limits in case of github. Alternatively, we can leave mlem ls and index using FileIndex implementation and still get rid of .mlem dir. But that may contradict our plans regarding artifacts.yaml and dvc.yaml, whatever they are

when using Git this whole directory is saved into Git, right?

yes, unless you store artifacts with dvc for example

for studio, we will have to implement repo scanning

This will mean that mlem only works with studio only if you use it with dvc/gto, which is very undesirable IMO. Again, for studio we can easily implement internal ls api

move deployments / envs into a single file (mlem.yaml)

From UX perspective I guess it's just different preferences. We can squash a.yaml, b.yaml into signle a: a.yaml contents, b: b.yaml contents file. Any statement like "separate files have more visibility" can be both pros and cons depending of what you are advocating for :)
From implementation perspective, changing this will be VERY complex, so we should go with it only if we have very strong arguments for it, which I don't see yet

@shcheklein
Copy link
Member

Thanks, @mike0sv @aguschin for clarifications.

lsing remote repos without index can lead to breaking request limits in case of github

that's a bit sad, but I feel fine to release w/o ls still, add it back and add an index if needed. Justifying this specific index-like structure by ls seems premature to me tbh.

From implementation perspective, changing this will be VERY complex, so we should go with it only if we have very strong arguments for it, which I don't see yet

thanks, Mike, that's helpful to know. Let's def not touch it now.

Thought-process I had in mind though is that for certain object (deployments, env) config files can be simpler and a bit more opinionated. It's not about making everything a single files. It goes more in direction of making things similar to dvc remotes. (you don't create a separate file for it), or Git remotes. In our case envs could be defined like this, maybe deployments also.

@aguschin aguschin added the design MLEM design question and decisions label Sep 16, 2022
@aguschin
Copy link
Contributor Author

closed by #395

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
design MLEM design question and decisions .mlem/ Everything inside .mlem folder ❓ type: question A question about MLEM
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants