Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cases: list of ideas #2544

Closed
3 of 5 tasks
jorgeorpinel opened this issue Jun 8, 2021 · 8 comments
Closed
3 of 5 tasks

cases: list of ideas #2544

jorgeorpinel opened this issue Jun 8, 2021 · 8 comments
Labels
A: docs Area: user documentation (gatsby-theme-iterative) C: cases Content of /doc/use-cases ✨ epic Placeholder ticket for multi-sprint direction, use story, improvement type: enhancement Something is not clear, small updates, improvement suggestions

Comments

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Jun 8, 2021

1. Data Management

2. Data Pipeline development

From #2544 (comment) below

model development may include data validation and preprocessing followed by model training and evaluation...
compose this as a dag where I can easily and efficiently run only the necessary stages
iteratively update data, add features, tune models, etc (overlaps with 2.)

3. Experiment Management

From #2270 (comment)

Preliminary ideas:

Here we should sell W&B, MlFlow, etc - rapid iterations, live metrics + other metrics + navigation

4. Production environments/ MLOps

From #2490 (comment)

4.1 DVC in Production
Training remotely
Deploying models (CLI or API)
Keep pipelines, artifacts in sync between environments
Batch scoring a.k.a. "DVC for ETL" - see #2512 (comment)
+ Distributed/parallel computing

Good example of user perspective: https://discord.com/channels/485586884165107732/485596304961962003/872860674529845299

4.2 ML Model Registry
Model lifecycle (training, shadow, active, inactive)
Automated/Continuous training (remotely)
Discovery and reusability
Deploying models
Batch scoring example
+ Real-time inference

4.3 Production Integrations
Databases (e.g. SQL dump versioning/preprocessing)
Spark (e.g. remote training)
AirFlow (e.g. batch scoring)
Kafka (e.g. real-time predictions)

4.4 End-to-end scenario with a combination from above, e.g.:
Importing data from Spark
Training remotely
Model Registry Ops
Batch scoring (AirFlow integration)

@jorgeorpinel jorgeorpinel added A: docs Area: user documentation (gatsby-theme-iterative) ✨ epic Placeholder ticket for multi-sprint direction, use story, improvement labels Jun 8, 2021
@dberenbaum

This comment has been minimized.

@jorgeorpinel

This comment has been minimized.

@jorgeorpinel
Copy link
Contributor Author

jorgeorpinel commented Aug 6, 2021

AirFlow (e.g. batch scoring) ...
End-to-end scenario

Cc @mnrozhkov I know you've worked quite a bit on this topic. So just pinging you here for visibility

p.s. our docs use cases are not enterprise-level so far, rather high-level and short. If you'd be interested in drafting one around these topics using your existing material please lmk!

@jorgeorpinel
Copy link
Contributor Author

jorgeorpinel commented Aug 6, 2021

Guys I'm giving this priority again per our current roadmap (now that #2587 is basically finished). I think Experiment Management is the most needed topic now, and along the lines @iesahin and I are working on (rel. #2548). But if anyone thinks another direction should have higher priority please comment.

And if we agree on Exp Mgmt. What should be the spin? i.e. user perspective problem/solution and key concepts. I discussed briefly with @shcheklein and we think it could be centered around running and managing rapid iterations in DS projects (without Git overhead) and concepts bookkeeping, hyperparameters, metrics, visualization.

What do you think? Cc @dberenbaum @flippedcoder @jendefig @casperdcl @tapadipti @dmpetrov @pmrowla

@iesahin
Copy link
Contributor

iesahin commented Aug 7, 2021

Bookkeeping + visualization seems the most relevant path to follow. Something along the lines of "push experiments to a central repository and see their comparative plots."

@shcheklein shcheklein added the p1-important Active priorities to deal within next sprints label Sep 28, 2021
@iesahin iesahin added the C: cases Content of /doc/use-cases label Oct 21, 2021
@jorgeorpinel
Copy link
Contributor Author

jorgeorpinel commented Dec 7, 2021

Some ideas for 3 (re - production environments/ MLOps)

path from development to production could be better... as a mode of operation I would favor a model where runs (e.g. artifacts, metrics, params, etc.) are pushed to production from a development environment. I am arguing for a model like git with remotes... where runs are captured locally first and then if confirmed a run can be pushed to a remote server. A model like this just keeps things more tidy... authentication could also be directly supported to make it easier to deploy for production...
For more production-oriented organizations ... for example production model monitoring

From https://megagon.ai/blog/whatmlflowsolvesanddoesntforus/

@jorgeorpinel
Copy link
Contributor Author

Interesting diagram inspiration for 1.3 or 1.4

image

From https://medium.com/google-cloud/migrate-kedro-pipeline-on-vertex-ai-fa3f2c6f7aad

@jorgeorpinel jorgeorpinel added type: enhancement Something is not clear, small updates, improvement suggestions and removed p1-important Active priorities to deal within next sprints labels Jan 14, 2022
@dberenbaum
Copy link
Contributor

4.3 Production Integrations
Databases (e.g. SQL dump versioning/preprocessing)
Spark (e.g. remote training)
AirFlow (e.g. batch scoring)
Kafka (e.g. real-time predictions)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: docs Area: user documentation (gatsby-theme-iterative) C: cases Content of /doc/use-cases ✨ epic Placeholder ticket for multi-sprint direction, use story, improvement type: enhancement Something is not clear, small updates, improvement suggestions
Projects
None yet
Development

No branches or pull requests

4 participants