-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Long-term goal: rogue workshop #15
Comments
This sounds like a great idea. Particularly focusing on the data pipeline up to the ML model. One small comment: I think also catering to less high-performance audiences would probably be helpful to a lot of people. My workflow, and I imagine that of many other people, is currently all on a local server, far away from the pipeline @nbren12 is building. I personally would find it helpful to have sessions on deciding whether it makes sense to port my workflow to the cloud and how to get started (for cloud-noobs like me). It might also be interesting to talk about the workflow from model/observation netCDF to keras/pytorch dataloader. Here is a rough outline of my current workflow:
|
Thanks for sharing! Indeed, I think this model is probably optimal for a single researcher and can easily be replicated by with a single instance on the cloud. This is what I did at UW and why I said this
However, I would disagree that automated infrastructure is only for "high-performance audiences". The single server model scales poorly for groups larger than 1, and a more automated approach to infrastructure results in more reproducible research IMO. Here is a slide I recently made on the subject: This becomes even more important for complicated ML pipelines, so I think it's important to communicate these trade-offs. |
Well that's great. The workshop could be a perfect opportunity then to teach plebs like myself, who are scared of the cloud, how to scale up! |
pfssh...not sure how many plebs can use tfrecords... |
That looks good, though I'd split the first stage "Local laptop/VM" splits into two varieties: (a) "unreproducible environment" and (b) "locally reproducible environment". Those two cases differ depending on whether results were run on whatever conda or pip packages happened to be installed, which itself is the result of some complex and unknown history of installations over time (b), or whether the environment has been captured in a pinned, reproducible, and hopefully minimal way (a). I think most results come from an environment of type (b), and I think the biggest increase in reproducibility comes from going from (b) to (a), because the conda or pip dependencies are generally the most specific to data science, the most quickly changing, and the most likely to affect the results, compared to all the other libraries on the system. After going from (b) to (a), then going to Docker or Kubernetes/CI achieves further reproducibility, but it's not as big a jump as simply pinning to make an environment reproducible locally... |
@jbednar I agree. (a) to (b) is a quantum leap. Unfortunately, its hard to verify if someone else’s software project is of type (a) or (b) without CI. I’m not sure if CI is something vital to ML pipeline development though. Maybe 50-50 reproduciblity is close enough... |
Using CI to force an escape from Schrodinger's reproducibility! :-) In practice we too use CI to ensure that we're in case (a) and not case (b) (see examples.pyviz.org), but it's at least possible to do the same by just passing it to another colleague... |
Sometimes it's easier to be friends with Travis, haha! |
I like the idea of this workshop and I think it would be useful - especially since so much of our time is spent on the data prep steps compared to the actual ML modeling. Here are my answers to your questions:
|
Thanks for sharing Jeff. This is unrelated, but I've been thinking a lot lately about the concept of MLOps. IMO, the devops world is pretty far ahead of the scientific community when it comes to reproducibility, since lack of reproducing has much higher consequences in the commercial world. I wonder if any of it translates to the academic context? Edit: corrected name |
I want to follow up on a short discussion we had at today's Pangeo ML group meeting. There is still interest in the workshop but no one has volunteered to take the lead on organizing the event, likely because of the time commitment involved. We also discussed the scope of the workshop, which could be very wide ranging but would ideally focus on some essentials. Two questions for the group:
I hope the answers to these questions help us focus priorities for the workshop. December is not going to be a realistic date at this point, but spring may be a promising time, especially if 1/2 days and virtual. |
Hi @djgagne , although I don't have sufficient experience in organizing the workshop, I am interested in assisting the workshop! Please feel free to let me know if there is anything that I can contribute to. |
@jhamman had the great idea in today's meeting of organizing an independent workshop on pangeo + ML to occur towards the end of the year.
I think this is a great opportunity to focus our thoughts into a coherent story, and recommend some potent infrastructure/know-how combinations for ML research in the geosciences.
Other workshops have mostly focused on clean ML datasets, but this workshop could focus on producing them. We said something like "Constructing ML Pipelines" would be natural title.
My $0.02 is that the best practice will depend on the organizational/team context. For example, I sometimes dream at night about using a filesystem like GLADE, but my team doesn't have access to that kind of machine. While it would be great to emphasize a common toolkit, I think we should point out divergence points and make strong suggestions.
As a start, it would be great to gather some brief impressions about the ML pipelines this group is building. I'm including my own answers as a guide below:
Google Cloud Platform.
Google cloud storage
netCDFs, zarrs, pickle files.
Google Cloud Dataflow (apache beam), K8s jobs
Less than 10. Orchestrated using a mix of methods including argo and custom scripting systems. Both launch/manage pods on a K8s cluster.
Docker containers with anaconda inside of them.
10s of TBs of input, 10s of GBs of final processed data. TBs of intermediate.
The text was updated successfully, but these errors were encountered: