Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split this project into two #27

Open
3 tasks
yuvipanda opened this issue Sep 15, 2022 · 1 comment
Open
3 tasks

Split this project into two #27

yuvipanda opened this issue Sep 15, 2022 · 1 comment

Comments

@yuvipanda
Copy link
Collaborator

yuvipanda commented Sep 15, 2022

Based on pangeo-forge/pangeo-forge-orchestrator#115 (comment), and pangeo-forge/pangeo-forge-orchestrator#115 (comment) I think we need to split this project into two.

Part 1

This should be responsible for:

  1. Fetching the appropriate feedstock from whevever (GitHub, Zenodo, etc) onto the local filesystem
  2. Creating an appropriate environment for the recipe to be parsed and to run. This can be via conda or via docker, must be pluggable

Most importantly, there should be no arbitrary code execution here. So it can read meta.yaml (carefully hehe) but not exec any .py files. This is what the orchestrator will call.

It will also not have any ties into the version of pangeo-forge-recipes needed for use by the appropriate feedstock.

Part 2

This should be responsible for actually executing arbitrary user code (in recipe.py file). This will be run in the environment created by part 1, and can be tied to a specific version of pangeo-forge-recipes. This part will be a separate python package, and should be installed in the environment created for it by part 1.

Open questions

  • Orchestrator (and end user) talks to part 1 directly, how will part 1 talk to part 2? My current suggestion is json over stdout from part 2 -> part 1, and traitlets config from part 1 -> part 2.
  • Creating environments will be a messy and difficult task to do right, particularly because we want dataflow / flink to run in the same custom environment as what we have for parsing. This is doable with docker (but requires pushing to a registry), but how do we do that for mamba / conda? This is going to end up being pretty complicated IMO
  • Part 2 becomes responsible for actually submitting the job to beam, so will need access to credentials for both storage as well as the bakery.
@sharkinsspatial
Copy link

@cisaacstern As requested, just referencing our recent experiences trying to incorporate arbitrary third party libs (https://github.com/nsidc/earthdata) while creating recipes for NASA datasets which require Earth Data Login (EDL) authentication with sessions rather than simple basic authentication due to the endpoint http redirects which occur when running the recipe in us-west-2.

This is a good example of some of the use cases discussed in pangeo-forge/pangeo-forge-orchestrator#115 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants