cases: list of ideas #2544

jorgeorpinel · 2021-06-08T18:10:11Z

1. Data Management

Data Versioning - https://dvc.org/doc/use-cases/versioning-data-and-model-files
"Organizing Team Datasets" per [spike] cases: address Sharing Data #3186
Data Registries - https://dvc.org/doc/use-cases/data-registries

2. Data Pipeline development

From #2544 (comment) below

model development may include data validation and preprocessing followed by model training and evaluation...
compose this as a dag where I can easily and efficiently run only the necessary stages
iteratively update data, add features, tune models, etc (overlaps with 2.)

3. Experiment Management

From #2270 (comment)

Preliminary ideas:

~~Hyperspace exploration [Tuning/Optimization] ? May be too low level~~ There's a blog about this now.
Experiment ~~Bookkeeping~~ Tracking (with Git): Rapid iterations. UPDATE: cases: Data Science Experiment Tracking #2782
Visualizing Data Science (experiments + params, metrics/plots) + Viewer
Experiment execution and orchestration (exp+machine+CML?)

From cases: Data Science Experiment Tracking #2782 (review)

Here we should sell W&B, MlFlow, etc - rapid iterations, live metrics + other metrics + navigation

4. Production environments/ MLOps

From #2490 (comment)

4.1 DVC in Production
Training remotely
Deploying models (CLI or API)
Keep pipelines, artifacts in sync between environments
Batch scoring a.k.a. "DVC for ETL" - see #2512 (comment)
+ Distributed/parallel computing

Good example of user perspective: https://discord.com/channels/485586884165107732/485596304961962003/872860674529845299

4.2 ML Model Registry
Model lifecycle (training, shadow, active, inactive)
Automated/Continuous training (remotely)
Discovery and reusability
Deploying models
Batch scoring example
+ Real-time inference

4.3 Production Integrations
Databases (e.g. SQL dump versioning/preprocessing)
Spark (e.g. remote training)
AirFlow (e.g. batch scoring)
Kafka (e.g. real-time predictions)

4.4 End-to-end scenario with a combination from above, e.g.:
Importing data from Spark
Training remotely
Model Registry Ops
Batch scoring (AirFlow integration)

The text was updated successfully, but these errors were encountered:

jorgeorpinel · 2021-08-06T04:17:02Z

AirFlow (e.g. batch scoring) ...
End-to-end scenario

Cc @mnrozhkov I know you've worked quite a bit on this topic. So just pinging you here for visibility

p.s. our docs use cases are not enterprise-level so far, rather high-level and short. If you'd be interested in drafting one around these topics using your existing material please lmk!

jorgeorpinel · 2021-08-06T04:32:27Z

Guys I'm giving this priority again per our current roadmap (now that #2587 is basically finished). I think Experiment Management is the most needed topic now, and along the lines @iesahin and I are working on (rel. #2548). But if anyone thinks another direction should have higher priority please comment.

And if we agree on Exp Mgmt. What should be the spin? i.e. user perspective problem/solution and key concepts. I discussed briefly with @shcheklein and we think it could be centered around running and managing rapid iterations in DS projects (without Git overhead) and concepts bookkeeping, hyperparameters, metrics, visualization.

What do you think? Cc @dberenbaum @flippedcoder @jendefig @casperdcl @tapadipti @dmpetrov @pmrowla

iesahin · 2021-08-07T05:47:42Z

Bookkeeping + visualization seems the most relevant path to follow. Something along the lines of "push experiments to a central repository and see their comparative plots."

jorgeorpinel · 2021-12-07T20:42:28Z

Some ideas for 3 (re - production environments/ MLOps)

path from development to production could be better... as a mode of operation I would favor a model where runs (e.g. artifacts, metrics, params, etc.) are pushed to production from a development environment. I am arguing for a model like git with remotes... where runs are captured locally first and then if confirmed a run can be pushed to a remote server. A model like this just keeps things more tidy... authentication could also be directly supported to make it easier to deploy for production...
For more production-oriented organizations ... for example production model monitoring

From https://megagon.ai/blog/whatmlflowsolvesanddoesntforus/

jorgeorpinel · 2022-01-04T01:55:30Z

Interesting diagram inspiration for 1.3 or 1.4

From https://medium.com/google-cloud/migrate-kedro-pipeline-on-vertex-ai-fa3f2c6f7aad

dberenbaum · 2022-04-28T15:02:11Z

4.3 Production Integrations
Databases (e.g. SQL dump versioning/preprocessing)
Spark (e.g. remote training)
AirFlow (e.g. batch scoring)
Kafka (e.g. real-time predictions)

Feast/feature stores: https://discord.com/channels/485586884165107732/563406153334128681/969249645073145896

jorgeorpinel added A: docs Area: user documentation (gatsby-theme-iterative) ✨ epic Placeholder ticket for multi-sprint direction, use story, improvement labels Jun 8, 2021

This comment has been minimized.

Sign in to view

jorgeorpinel mentioned this issue Sep 1, 2021

cases: Data Science Experiment Tracking #2782

Merged

4 tasks

shcheklein added the p1-important Active priorities to deal within next sprints label Sep 28, 2021

iesahin added the C: cases Content of /doc/use-cases label Oct 21, 2021

jorgeorpinel added type: enhancement Something is not clear, small updates, improvement suggestions and removed p1-important Active priorities to deal within next sprints labels Jan 14, 2022

jorgeorpinel mentioned this issue Feb 11, 2022

[spike] cases: address Sharing Data #3186

Closed

jorgeorpinel mentioned this issue Mar 3, 2022

cases: Model Registries #3333

Merged

dberenbaum closed this as completed Oct 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cases: list of ideas #2544

cases: list of ideas #2544

jorgeorpinel commented Jun 8, 2021 •

edited

Loading

This comment has been minimized.

This comment has been minimized.

jorgeorpinel commented Aug 6, 2021 •

edited

Loading

jorgeorpinel commented Aug 6, 2021 •

edited

Loading

iesahin commented Aug 7, 2021

jorgeorpinel commented Dec 7, 2021 •

edited

Loading

jorgeorpinel commented Jan 4, 2022

dberenbaum commented Apr 28, 2022

cases: list of ideas #2544

cases: list of ideas #2544

Comments

jorgeorpinel commented Jun 8, 2021 • edited Loading

1. Data Management

2. Data Pipeline development

3. Experiment Management

4. Production environments/ MLOps

This comment has been minimized.

This comment has been minimized.

jorgeorpinel commented Aug 6, 2021 • edited Loading

jorgeorpinel commented Aug 6, 2021 • edited Loading

iesahin commented Aug 7, 2021

jorgeorpinel commented Dec 7, 2021 • edited Loading

jorgeorpinel commented Jan 4, 2022

dberenbaum commented Apr 28, 2022

jorgeorpinel commented Jun 8, 2021 •

edited

Loading

jorgeorpinel commented Aug 6, 2021 •

edited

Loading

jorgeorpinel commented Aug 6, 2021 •

edited

Loading

jorgeorpinel commented Dec 7, 2021 •

edited

Loading