-
Notifications
You must be signed in to change notification settings - Fork 225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Experiment tracking proposal #195
Conversation
This PR is supposed to start proper design discussion regarding experiment tracking. Please, feel free to review, comment or commit new changes.
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: If they are not already assigned, you can assign the PR to them by writing The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Thanks for doing this. Can your provide more information about the alternative solutions? What does Katib need? |
I'll slowly do research, but could use help if anyone have experience with any of alternatives. Also, let me know if you know any alternatives that are missing. I'll try to dig more into each architecture, but at the end of the day experience is priceless. /cc @holdenk - Holden, maybe you could help us with MLFlow? Or point us to someone who knows more about this project? As for Katib, I'm not sure what Katib really needs aside from experiment tracking. @YujiOshima could you help us figure out what requirements Katib has from it and if ModelDB is enough or we need something more? |
@inc0 @jlewi In Katib, the long term DB for experimental tracking is completely separated from katib DB.
Above requests are needed to do by both of GUI and API. There is many choices(MLFlow, StudioML..) and the best choice is depending on the user. |
@inc0 In this proposal, you plan to make a new tool for model management, experimental tracking? |
@YujiOshima Why do we need to make a new tool? Instead,Isn't it better to make existing tools like modelDB or others better? |
@johnugeorge If an existing tool is enough, I agree. |
@YujiOshima I agree. I feel that we have to first list down missing features/requirements in the current tools and then take a call on whether to support the existing ones or implement a new tool. |
That's what I tied to did in this PR, I have issues with ModelDB being based on mongo, which isn't easiest thing to maintain. Also need for same information (list of experiments) in 2 separate databases (katib and modeldb) is very problematic. We should create something with API and allow Katib use it as source of truth. |
|
||
* Tensorboard - Wasn't meant for large number of models. It's better for very detailed examination of smaller number of models. Uses tf.Event files | ||
* MLFlow - One of big cons for this is using files as storage for models. That would require something like dask or spark to query them efficiently. Can store files in multiple backends (S3 and GCS among other things) | ||
* ModelDB - Requires mongodb which is problematic |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not only mongo but sqllite. And it reset the sqlite at the beginning of a process.
We can't persistent data without modification.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, so it's no good for persistent experiment tracking, which we're after
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about first design a better modelDB equivalent and then use that for tracking experiments? I would recommend we keep each of these very independent for now. So that Kubeflow components/apps can integrate with a wide variety of tools e.g. TFX, katib, autoML.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reason I think our current model is flawed is because we have 2 sources of truth. Katib uses sql, modeldb uses mongodb or sqlite. Every time you want modeldb will sync stuff from katibs db. That means if you do sync with tens of thousands of models, it's going to lock whole system. I think we should build single source of truth of where models are and how they performed and Katib should use it. This would negate need for Katibs database alltogether and, therefore, made it much easier to handle. In another issue we've discussed Katib as model management tool, but we've decided that Katibs scope is hyperparameter tuning, and model management is something different (however required).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi team, I'm the PM on the MLflow team at Databricks. Some of the engineers will chime in here too. Adding a database-backed tracking store to the tracking server is on our roadmap, and there is already a pluggable API!
## Alternatives | ||
|
||
* Tensorboard - Wasn't meant for large number of models. It's better for very detailed examination of smaller number of models. Uses tf.Event files | ||
* MLFlow - One of big cons for this is using files as storage for models. That would require something like dask or spark to query them efficiently. Can store files in multiple backends (S3 and GCS among other things) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it really need dask or spark? It has REST API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well if you'll try to query 50000 records from one file (and by query I mean "highest value of X") it's going to require something more...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although MLFlow does store files on disk. It would save sometime if folks looked at forking it and then integrating a database to store the tracking information.
TensorBoard is so useful but it is not suitable for general Experimental Tracking. |
That's the idea @YujiOshima :) I was thinking of something like button "spawn Tensorboard from these 3 models" |
Also Tensorboard will have support for PyTorch, which is super cool. We would still need something for scikit-learn but it's getting better! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should have clear requirements for an independent model tracking (modelDB equivalent) and experiment tracking that can be then leveraged by katib, autoML, pytorch, TFX integration. Then define the API. We are also very interested in contributing to model management but would like to take it slowly - get a straw man working (like @jlewi mentioned), validate the requirements to ensure it works well with different tools. Else we will have to do a lot more work down the road. Could we please form a small sub team to do this?
|
||
* Tensorboard - Wasn't meant for large number of models. It's better for very detailed examination of smaller number of models. Uses tf.Event files | ||
* MLFlow - One of big cons for this is using files as storage for models. That would require something like dask or spark to query them efficiently. Can store files in multiple backends (S3 and GCS among other things) | ||
* ModelDB - Requires mongodb which is problematic |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about first design a better modelDB equivalent and then use that for tracking experiments? I would recommend we keep each of these very independent for now. So that Kubeflow components/apps can integrate with a wide variety of tools e.g. TFX, katib, autoML.
@jlewi The mlflow does provide a python api that can be plugged into a jupyter notebook then later when DS want to track a parameter or a metric they can view it off a dashboard. Also another thing in terms of design is that you can compare multiple runs side by side. Another thing to note is that it does have a rest api as well. Then when it comes to model deployments, it integrates with sagemaker, azure ml, regular model serving. |
@zmhassan with MLFlow, let's hypothetically assume we have 50 000 of models for detecting cats. Is there easy way to select model with highest accuracy and spawn seldon (or tf-serving) out of them? Quick look at API doesn't look like MLFlow has any form of querying. I also don't see whole lot of model provenance out there, but that probably could be implemented. Also, how easy would it be to integrate it to TFJob? As in start tfjob from this run, retrain model X etc. Another thing is integration with tensorboard. MLFlow seems to be alternative to tensorboard, and tensorboard, for what it is (UI for examining models performance) is excellent (imho). Any chances we could keep using it? |
Adding more detail around experiment tracking.
Proposal to experiment tracking feature
So there's good news and bad news. 👍 The good news is that everyone that needs to sign a CLA (the pull request submitter and all commit authors) have done so. Everything is all good there. 😕 The bad news is that it appears that one or more commits were authored or co-authored by someone other than the pull request submitter. We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that here in the pull request. Note to project maintainer: This is a terminal state, meaning the |
The Katib page has a screencast illustrating ModelDB. As a strawman this looks like it has most of the features we want
I realize there are concerns about MongoDB but for a strawman running it in a container with a PVC seems fine. Have folks checked out the demo: Its pretty slick and it looks like it provides most of what we want. I'd be thrilled if we managed to get that working and part of the 0.4 release. My conjecture is that if we get some more first hand experience with ModelDB we'll be in a better position to figure out where to go from here. |
Are we going to track all experiments, last time I talked to data scientist, it's not very useful for them to track all experiments, much like software debugging, where we change code and experiment, without using git commit; we checkin code only when we feel comfortable about our change. modeldb wraps existing libraries (sklearn, sparkml) to sync model data, e.g. users are required to use sync version instead of stock methods, which I think can be fragile to library change and require extra work to support more frameworks. Also, IIRC, syncing model data takes considerable time. |
This document is design proposal for new service within Kubeflow - experiment tracking. Need for tool like this was | ||
expressed in multiple issues and discussions. | ||
|
||
## What is experiment tracking |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should focus on experiment tracking
. This is different from monitoring your production models, like gathering metrics about model drift or accuracy in the production env.
* I'm a data scientist working on a problem. I'm looking for easy way to compare multiple training jobs with multiple sets of hyperparameters. I would like to be able to select top 5 jobs measured with P1 score and examine which model architecture, hyperparameters, dataset and initial state contributed to this score. I would want to compare these 5 together in highly detailed way (for example via tensorboard). I would like rich UI to navigate models without need to interact with infrastructure. | ||
* I'm part of big ML team in company. Our whole team works on single problem (for example search) and every person builds their models. I'd like to be able to compare my models with others. I want to be safe that nobody will accidentally delete model I'm working on. | ||
* I'm cloud operator in ML team. I would like to take current production model (architecture+hyperparams+training state) and retrain it with new data as it becomes available. I would want to run suite of tests and determine if new model performs better. If it does, I'd like to spawn tf-serving (or seldon) cluster and perform rolling upgrade to new model. | ||
* I'm part of highly sophisticated ML team. I'd like to automate retraining->testing->rollout for models so they can be upgraded nightly without supervision. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This e.g. is not part of experiment tracking imho. It's about model management and model monitoring.
ls there a good/common term to describe this operational bit of models?
Model management, Model operations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Model management is alternative term to experiment tracking I think. At least I've understood it as such. As for functionality, because we'll make it k8s-native, cost of adding this feature will be so low that I think we should do it just for users benefit. Ongoing monitoring of models isn't something in scope, but as long as this monitoring agent saves observed metrics (say avg accuracy over last X days) back to this service, you still can benefit from this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the notion of "model management" and "experiment tracking" is slightly different. "management" has a production connotation and "experiment" has a devel connotation. Did @jlewi in this comment thread get to a common definition? This mlflow issue has also a discussion around the use case of the various tools around. And a google search for "experiment tracking" ai ml vs "model management" ai ml gives 500 vs 75k results.
Please dont get me wrong. I'm all for having a solution for this, because I think too this is a missing component of kubeflow.
I'd just limit the scope to the devel side of the house and let pachyderm and seldon focus on the production side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So one clarification - when I'm saying, for example, model rollout, what I mean is single call to k8s to spawn seldon cluster. Actual serving, monitoring etc is beyond scope, I agree, but I think it'd be a nice touch to allow one-click mechanism. For Pachyderm integration look lower, I actually wanted to keep pipeline uuid in database. If someone will use pachyderm, we'll integrate with it and allow quick navigation. For example one-click link to relevant pachyderm ui
* Feature engineering pipeline used | ||
* Katib study id | ||
* Model architecture (code used) | ||
* Hyperparameters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the code to create the model is in a VCS, e.g. git, it should also track the version of the code used to create the model
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree, that's what I meant by "model architecture". But good idea would be to make it point to
- code (including commit id)
- docker image
For selected models we should be able to setup model introspection tools, like Tensorboard. | ||
Tensorboard provides good utility, allows comparison of few models and recently it was announced that it will integrate with pytorch. I think it's reasonable to use Tensorboard | ||
for this problem and allow easy spawn of tensorboard instance for selected models. We might need to find alternative for scikit-learn. Perhaps we can try mlflow for scikit-learn. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is also http://tensorboardx.readthedocs.io which can create a tensorboard from any python code.
We've started working with mlflow because it has a nice web ui and is easy to use with it's python framework.
I dont know if tensorboard with tensorboardx has some benefits though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it does! You could use it to add Tensorboard to scikit-learn (just log accuracy every batch of training)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MLflow and tensorboard and tensorboardx are complimentary. I have worked with some MLflow users who use them together. Check out this example in the MLflow repository of using MLflow with PyTorch and tensorboardx: https://github.com/mlflow/mlflow/blob/master/examples/pytorch/mnist_tensorboard_artifact.py
From the doc at the top of that code example:
Trains an MNIST digit recognizer using PyTorch, and uses tensorboardX to log training metrics and weights in TensorBoard event format to the MLflow run's artifact directory. This stores the TensorBoard events in MLflow for later access using the TensorBoard command line tool.
* I’m part of the ML team in company. I would like to be able to track my parameters used to train an ML job and track the metrics produced. I would like to have isolation so I can find the experiments I worked on. I would like the ability to compare multiple training rules side by side and pick the best model. Once I select the best model I would like to deploy that model to production. I may also want to test my model with historical data to see how well it performs and maybe roll out my experiment to a subset of users before fully rolling out to all users. | ||
|
||
|
||
## Scale considerations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The straight forward aproach would be to use kubernetes jobs for this. Let kubernetes handle the orchestration and GC. Each job would be configured with env variables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need more than just number of replicas. It's important thing to consider when selecting underlying database
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I'm thinking an experiment
would be a kubernetes primitive, like a job
- no replicas involved. The job
will be scheduled by kubernetes. So if you run 100 or 1000 experiments, you just create them and let kubernetes handle the scaling, i.e. the scheduling
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if that's what you mean, but we've discussed using CRDs as experiments but decided against it. Sheer number of experiments involved and lack of querying is a problem. We still need database somewhere. As for running actual experiments, then yeah, they will be tfjobs so regular pods.
## Alternatives | ||
|
||
* Tensorboard - Wasn't meant for large number of models. It's better for very detailed examination of smaller number of models. Uses tf.Event files | ||
* MLFlow - One of the cons for this is that experiment metadata is stored on disk. In kubernetes may require persistent volumes. A better approach would be to store metadata in a database. Pros, can store models in multiple backends (S3, Azure Cloud Storage and GCS among other things) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not care if mlflow stores it in a db or in a file. If you use mlflow we should use their REST api as the interface and let them handle persistence. And for a DB you'd also need a PV, so 🤷♂️
We have started a repo to make mlflow run on openshift: https://github.com/AICoE/experiment-tracking
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Managed database (like big table) doesn't (or at least you don't care about it). Biggest issue for MLFlow API (which is directly tied to file storage) is lack of querying. Currently (unless I'm mistaken) there is no good way in MLFlow to ask for model with highest accuracy. It could be implemented, but then comparisons would be done in python, so not super scalable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Managed database (like big table)
wouldnt this introduce a dependency that kubeflow wants to avoid?
Biggest issue for MLFlow API (which is directly tied to file storage) is lack of querying
Actually there is a REST API for that. But I havent used it and I'm not sure how good it scales
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#188 I've noted it here briefly with some consequences and how to make it manageable for operators (imho).
As for search string it's really not much. I still can't see option of "get me best model" without using spark/dask
I think we can track all experiments, that's why we've been putting emphasis on scale. I expect great majority of models to never see production. I'd say commit to branch->train->save to exp_track->look up tensorboard, compare with current best model etc->decide what to do, move to prod or ignore forever |
ML Hyperparameter Tuning:I think if we make the proposal more focused on the use case here. The purpose of tracking experiments is that we want to perform hyperparameter tuning. Unrelated items in proposal:
ConclusionI think if we keep it simple we can get something working (mvp). The main motivation why I suggested MLFlow was because it supports multiple ML frameworks, not just tensorflow. To have wider adoption I suggest we cast our net wider and allow for more ML frameworks to be utilized. |
Some thoughts, At Sentient, studioml, our focus has been on enabling our evolutionary NN experiments. As a previous contributor has mentioned, workflows concentrating on single experiments (or individuals) have scale friction. That however should not pose an issue if all meta data placed by experiments on traditional storage is machine readable. In some instances we are specifically interested in looking at individual experiments, for example during development or when beginning work in a specific domain. In other situations we can ingest experiment data into our own tooling and investigate at much higher levels of granularity. Couple this with having idempotent data and a small set of rules about incremental results enables the storage to become the system of record, rather than a DB. it also helps with the who owns what issue deferring that to the storage platform. Defining an entity model, relationships and forcing an architecture might reduce deployment freedom. One alternative, for example, we have chosen is to concentrate on a portable data format as the primary means of defining our interfaces. Each technology we then use for UI, reporting, project authoring etc aligns around the formats. For our own purposes we are using JSON and S3 artifacts to act as our system of record. Queries on the S3 platform, for example, can be done using DSLs from cloud providers or tooling can be used to migrate meta-data to other DB's etc. This defers technology compromises to the consumer. For us that means we can avoid the mongo loose schema, and lost data problems, or going the spark big-data route. This also avoids the issue of who is the responsible party for the cost of non-artifact storage and queries, the user pays rather than the experimenter. Anyway thats my 2 cents. |
Thanks @karlmutch
I don't think we want to limit our selves to hyperparameter tuning. I think a very common use case is just training a model in notebook; saving that model somewhere and wanting a simple way to track/browse all such models. |
@jlewi Good point. Definitely good to capture all use cases. We experimented with mlflow you can import the python library into Jupyter notebook and be able to track parameters/metrics. @karlmutch Definitely interested in reading up on studioml. |
One issue with this approach (unless coupled with database, which I propose) is that it would be extremely hard to find model based on performance. I'm talking about query like "show me top 5 models with highest p1 score for experiment cat or dog". Can you deal with this kind of queries too? |
In the cases where experimenters are using cloud based storage the cloud vendors typically offer query engines that support this use case. AWS Athena and from memory Google Cloud Datastore will both do this, not sure from memory about Azure as we use our own tech on their stack. In the case where a tool like Minio is being used, or our customers/experimenters wish to make use of their own query engine then the json data is ingested into a DB. Things remain coherent between the store and the DB because we follow idempotency and simple rules around experiments that are in progress when the ingest occurs. From the Studio perspective however the choices in this area are not mandated by the experimentation framework. |
This is an awesome discussion. I'm a PM at Databricks, I work full time on MLflow, and I'm a huge fan of KubeFlow; it's been on our roadmap to engage about exactly this topic. In addition to what @zmhassan has been sharing, if there is any way we can be helpful, I know @mateiz @mparkhe @pogil @aarondav and others in the MLflow community would be excited to help answer any questions, discuss architecture, etc. In particular, I think @mparkhe -- an MLflow core dev -- has been thinking about this and has some more details. |
As @andyk mentioned above, the MLflow team has been thinking about various engagement efforts and how we can support those within MLflow architecture. We'd be excited to have MLflow be a component of the Kubeflow ecosystem. To that effect, we would love to understand what would be required to make this happen. Here are some details about specific projects on our roadmap, that would help answer questions about MLflow. Scalable backend store for Tracking Server dataThe current storage layer for experiments, runs and other data is files. However, we are planning to add support for other scalable storage. Query layer will be built using a generic query layer like SQLAlchemy that would support most relational databases. There have been some queries for support of KV stores, however search query pattern supported by MLflow APIs, is most suited for relational databases out of the box, without the need for additional modeling of tables for each different types of key value stores. As a side note, MLflow supports ability to store large file artifacts like models, training and test data, images, plots, … etc in several implementations of artifact stores: AWS S3, Azure Blob Store, GCP, … etc. The query pattern for these APIs is to access the actual artifact and purely dependent on these object store implementations. APIs and InterfacesOne of the design principles for all 3 MLflow components was API first. We designed these interfaces, first and then various different implementations to be able to plug into them. Current open source has implementations for FileStore and RestStore, which implement these interfaces over storage layer. Even with FileStore implementations, most of the APIs are efficient, since they index into a specific experiment/run folder. With many experiments and runs, search API can get slow since FileStore will access underlying data and then the search functionality is implemented in python. With the above mentioned change to have a SQLAlchemy layer, these queries would be re-written to be pushed down the the appropriate backend query engine. This would enable to plug in any database backend store that can be queried through SQLAlchemy and we expect this solution to scale. For production use cases, using relational DB like MySQL would work. We have stress tested with realistic production workloads (1000s of experiments, ~ 100K runs, millions of metrics, … etc) for MLflow API pointed to MySQL backend (local machine and RDS) and found it to be performant. For instance, an indexed MySQL table for metrics returned desired results in single digit milliseconds. We believe that such an implementation would be suitable for most production workloads. MLflow UIThe current UI implementation supports almost all APIs, including Search and viewing artifacts. We are working on releasing the next version to have more feature coverage like CRUD operations on MLflow entities (eg: deleting and renaming experiments), support easy visibility of multi-step workflows. Many UI components like graphical view for run metrics has been contributed by non-Databricks community members. The MLflow team at Databricks will continue to support and add more functionality here. We look forward to contributions to MLflow and also collaborating to Kubeflow ecosystem towards this common goal. |
+1 I've been experimenting with porting over mlflow to run on kubernetes with crd's and operators. Once that is done and setup to work with ksonnet. It will be a simple ksonnet installation. |
Hey all, this is Manasi from ModelDB. It's pretty clear that multiple groups are working towards the same goal of a general-purpose model management system. Can the KubeFlow community think about defining a generic interface for model management? (or requirements for a model management system to be compatible with KubeFlow?) That way there can be multiple implementations of KubeFlow-compatible model management systems and users can pick the one that works best for them. I imagine this would be similar to having multiple model serving implementations for KubeFlow. |
Close this because its stale. |
What is the current status of this proposal/of experiment tracking in Kubeflow? Is there another proposal which supersedes this one or is progress tracked somewhere else? |
Provide VERSION variable with default value for example script
This PR is supposed to start proper design discussion regarding
experiment tracking. Please, feel free to review, comment or commit new
changes.
Relevant conversations:
kubeflow/kubeflow#264
kubeflow/kubeflow#136
This change is