Split this project into two #27

yuvipanda · 2022-09-15T02:25:51Z

Based on pangeo-forge/pangeo-forge-orchestrator#115 (comment), and pangeo-forge/pangeo-forge-orchestrator#115 (comment) I think we need to split this project into two.

Part 1

This should be responsible for:

Fetching the appropriate feedstock from whevever (GitHub, Zenodo, etc) onto the local filesystem
Creating an appropriate environment for the recipe to be parsed and to run. This can be via conda or via docker, must be pluggable

Most importantly, there should be no arbitrary code execution here. So it can read meta.yaml (carefully hehe) but not exec any .py files. This is what the orchestrator will call.

It will also not have any ties into the version of pangeo-forge-recipes needed for use by the appropriate feedstock.

Part 2

This should be responsible for actually executing arbitrary user code (in recipe.py file). This will be run in the environment created by part 1, and can be tied to a specific version of pangeo-forge-recipes. This part will be a separate python package, and should be installed in the environment created for it by part 1.

Open questions

Orchestrator (and end user) talks to part 1 directly, how will part 1 talk to part 2? My current suggestion is json over stdout from part 2 -> part 1, and traitlets config from part 1 -> part 2.
Creating environments will be a messy and difficult task to do right, particularly because we want dataflow / flink to run in the same custom environment as what we have for parsing. This is doable with docker (but requires pushing to a registry), but how do we do that for mamba / conda? This is going to end up being pretty complicated IMO
Part 2 becomes responsible for actually submitting the job to beam, so will need access to credentials for both storage as well as the bakery.

The text was updated successfully, but these errors were encountered:

sharkinsspatial · 2022-09-24T22:18:27Z

@cisaacstern As requested, just referencing our recent experiences trying to incorporate arbitrary third party libs (https://github.com/nsidc/earthdata) while creating recipes for NASA datasets which require Earth Data Login (EDL) authentication with sessions rather than simple basic authentication due to the endpoint http redirects which occur when running the recipe in us-west-2.

This is a good example of some of the use cases discussed in pangeo-forge/pangeo-forge-orchestrator#115 (comment).

yuvipanda mentioned this issue Sep 15, 2022

Use venvs to support multiple pangeo-forge-recipes versions pangeo-forge/pangeo-forge-orchestrator#115

Closed

cisaacstern mentioned this issue Sep 16, 2022

Memory spike for lazily accessing netCDF3 file with scipy backend pangeo-forge/pangeo-forge-recipes#361

Open

This was referenced Sep 26, 2022

Add pangeo-forge-esgf + python-cmr to app container pangeo-forge/pangeo-forge-orchestrator#137

Merged

fix typo in feedstock's metadata pangeo-forge/casm-feedstock#2

Merged

derekocallaghan mentioned this issue Nov 8, 2022

Add support for running local recipes during development #43

Open

cisaacstern mentioned this issue Jan 4, 2023

Spec recipes 0.9.3 in requirements.txt pangeo-forge/pangeo-forge-orchestrator#205

Merged

cisaacstern mentioned this issue Oct 25, 2023

Fix Flink Integration Tests #114

Merged

cisaacstern mentioned this issue Dec 5, 2023

Infer Runner Deps from Recipe requirements.txt #154

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split this project into two #27

Split this project into two #27

yuvipanda commented Sep 15, 2022 •

edited

Loading

sharkinsspatial commented Sep 24, 2022

Split this project into two #27

Split this project into two #27

Comments

yuvipanda commented Sep 15, 2022 • edited Loading

Part 1

Part 2

Open questions

sharkinsspatial commented Sep 24, 2022

yuvipanda commented Sep 15, 2022 •

edited

Loading