Skip to content
This repository has been archived by the owner on Dec 7, 2023. It is now read-only.

Use venvs to support multiple pangeo-forge-recipes versions #115

Closed
cisaacstern opened this issue Sep 13, 2022 · 11 comments
Closed

Use venvs to support multiple pangeo-forge-recipes versions #115

cisaacstern opened this issue Sep 13, 2022 · 11 comments
Assignees
Labels
enhancement New feature or request

Comments

@cisaacstern
Copy link
Member

Per conversation at yesterday's Coordination meeting, we should use venvs to support multiple versions of pangeo-forge-recipes.

To do so, we'll need to:

  1. Setup venvs in the Dockerfile
  2. Use the pangeo_forge_version passed in meta.yaml as the venv for subprocess calls to pangeo-forge-runner.

A circularity challenge: we need to call pangeo-forge-runner expand-meta to determine the pangeo_forge_version. That happens in two places: here (for test runs) and here (for prod runs). This is problematic because, in the case of dict_objects, pangeo-forge-runner expand-meta imports the recipe, in order to determine the names of the recipes.

This means that we probably need a PR to pangeo-forge-runner which adds a --dont-expand-dict flag to expand-meta, or alternatively simply a different command such as get-meta, which does the same thing as expand-meta, without expanding dict objects. In either case, what we need is a way to ascertain the pangeo_forge_version from meta.yaml without any risk of triggering a recipe import in pangeo-forge-runner.

An alternate path would be to lift this pangeo_forge_version concern into the orchestrator itself (via a GitHub API call to get the contents of meta.yaml). My sense is that this feature (using a dynamically determined venv as the running environment for pangeo-forge-runner) is useful for all users of pangeo-forge-runner (not just the orchestrator), so it is probably best kept at that layer?

One further thought on venvs: one possibility would be to pre-build venvs for all supported pangeo-forge-recipes releases in the Dockerfile. Another option would be to have pangeo-forge-runner dynamically provision a venv (perhaps even in a tempdir, if that's possible?) based on the meta.yaml. The latter option seems more elegant to me, and more broadly useful for other users of pangeo-forge-runner, though I'm not totally clear how challenging it may be.

cc @rabernat @yuvipanda @sharkinsspatial

@sharkinsspatial
Copy link

sharkinsspatial commented Sep 13, 2022

@cisaacstern and I circled up today to discuss options venv management. Based on this discussion we determined some short, medium and long term goals and how we might achieve them.

Short term

An approach which allows @rabernat to easily run full integration tests with experimental tags of pangeo-forge-recipes. On a slight variation to what @cisaacstern described above we propose

  1. Modify pangeo-forge-orchestrator to create a venv in a subprocess prior to executing pangeo-forge-recipes commands.
  2. Refactor pangeo-forge-runner to remove install time dependencies on pangeo-forge-recipes.
  3. Refactor pangeo-forge-runner to check for the presence of and runtime install the version or git tag of pangeo-forge-recipes specified in the recipe's `meta.yaml'.

I have not yet experimented with how runtime usage of pip in subprocess will interact with the parent venv. If this proves unworkable we may need to rethink this approach and create a small separate tool which can parse the pangeo_forge_version from the meta.yaml as stdout and then use the cli to install the appropriate version of pangeo-forge-recipes prior to installing pangeo-forge-runner in the venv. I'm interested to hear thoughts on both approaches. This approach assumes a recipe with simple dependencies (xarray) that are already satisfied by pangeo-forge-recipes dependencies and will fail for recipes with additional dependencies.

Medium term

To address this dependency issue, minimize issues around base Docker image management and allow version backward compatibility we propose

  1. Extend the recipe structure specification to include a requirements.txt or environment.yaml file with additional dependencies not covered by the specified pangeo_forge_version.
  2. Refactor pangeo-forge-runner to install these dependencies into the venv prior to recipe "baking" (import, parsing and submission).
  3. Extend pangeo-forge-runner to create a Docker image using the pangeo_forge_version specified in the meta.yaml and the requirements.txt and push that image to a common repository with a specified tag that can then be used by bakery clusters when the recipe is run.

Long term

@cisaacstern and @yuvipanda have expressed some valid concerns about the security implications of installing packages into a venv at runtime into on a production application server. One option they've discussed is running all pangeo-forge-runner commands in an isolated Docker container rather than via a subprocess but this is not possible with the current Heroku based deployment model.

  1. Explore alternative hosting and deployment options for pangeo-forge-orchestrator that would allow direct docker run calls.

@sharkinsspatial sharkinsspatial self-assigned this Sep 13, 2022
@yuvipanda
Copy link

IMO, medium term we should move everything completely to Docker and not use venv at all! I've an alternate suggestion, just getting back into things after taking the day off yesterday - will respond by end of day.

@cisaacstern cisaacstern added the enhancement New feature or request label Sep 14, 2022
@yuvipanda
Copy link

The core of the problem is this:

  1. Refactor pangeo-forge-runner to remove install time dependencies on pangeo-forge-recipes.

That's because pangeo-forge-runner executes the recipe.py file with an exec here: https://github.com/yuvipanda/pangeo-forge-runner/blob/80b3ee49c4aecffac0e32963edc42b6988813701/pangeo_forge_runner/feedstock.py#L40. It's the core of the code, and I think moving that to an out of process executor will be extremely complicated. This exec call or something like that will still need to be there somewhere, as recipe.py is arbitrary python code. So we pangeo-forge-runner will always be tied to the version of pangeo_forge_recipes installed in the same environment, because of this exec call. And because of this arbitrary code execution, I think pangeo-forge-runner should always be counted as untrusted and should run in a fully isolated environment (like docker) rather than something like venv.

Here's what I think is a short-term solution:

  1. We modify the dockerfile that deploys the orchestrator to support multiple pangeo-forge-recipes versions via a venv per version
  2. We modify pangeo-forge-runner to error out if the feedstock yaml has a version that doesn't match the version in its current environment, and provide a useful response with the version required. This is stolen from the python "Eaiser to ask for forgiveness than permission" (https://devblogs.microsoft.com/python/idiomatic-python-eafp-versus-lbyl/) principle, along with the fact that you need to actually clone the repo first to know what the meta.yaml says. I don't want us to go back to fetching the file from github directly, as then we'll lose support for stuff like zenodo as we do now.
  3. Orchestrator will by default call pangeo-forge-runner within the venv with the version of pangeo-forge-recipes we most likely think will work, so it works on the first try. If there is a version mismatch, pangeo-forge-runner will tell orchestrator, and we can wrap this in a try / catch and call the correct version (which the pangeo-forge-runner error message will tell us). This should work even after we move to docker.

I think this should be reasonably low hanging fruit to implement from both orchestrator as well as in the runner, and gives us what we want now (venv based support for multiple pangeo-forge-recipes) without blocking longer term more container based approaches.

@yuvipanda
Copy link

I also think this can all be done soon enough to help @rabernat with the beam refactor, and we can then work on ways to figure out how to containerize this properly. That would involve definitely moving away from heroku to something else, and probably more first-class support for a requirements.txt style workflow.

@cisaacstern
Copy link
Member Author

cisaacstern commented Sep 15, 2022

@yuvipanda thanks for the thoughtful response.

an exec ... It's the core of the code, and I think moving that to an out of process executor will be extremely complicated. This exec call or something like that will still need to be there somewhere, as recipe.py is arbitrary python code.

To explore the pathway proposed by @sharkinsspatial a bit further, I just wanted to highlight that the proposal was to remove install time dependency on pangeo-forge-recipes, not runtime dependency. As I'd imagined this working, we would delay install of pangeo-forge-recipes until after pangeo-forge-runner had already fetched the feedstock repo, and introspected the required pangeo_forge_version from meta.yaml. At that point, pangeo-forge-runner would dynamically install pangeo-forge-recipes in the current env with a subprocess call, with something like this:

def install_pangeo_forge_recipes(version):
    pkg = (
        f"pangeo-forge-recipes=={version}"
        if not version.startswith("@")
        else f"git+https://github.com/pangeo-forge/pangeo-forge-recipes.git{version}"
    )
    cmd = [sys.executable, "-m", "pip", "install", "-q", pkg]
    logger.info(f"\nInstalling {pkg} with pip in '--quiet' mode. This may take a few moments...\n")
    subprocess.check_call(cmd)
    logger.info(f"Installed {pkg}!")

So this would not mean exec is moved out of process, rather just that pangeo-forge-runner modifies its environment at runtime according to the requirements of the particular feedstock.

@cisaacstern
Copy link
Member Author

cisaacstern commented Sep 15, 2022

pangeo-forge-runner modifies its environment at runtime according to the requirements of the particular feedstock.

An advantage I see here, entirely apart from the possible usefulness for orchestrator, is that if I am just a "regular" pangeo-forge-runner user, trying out recipes from the command line, then I don't need to manually modify my environment each time I call pangeo-forge-runner on a different feedstock.

@yuvipanda
Copy link

I had misunderstood what dependency you wanted removed, thanks for clarifying that @cisaacstern!

IMO, I want us to do the least amount of work possible right now to unblock @rabernat, and then move towards containerization asap. An additional complication with installing pangeo-forge-recipes at runtime would be that you still need one venv per invocation of pangeo-forge-runner (when called from orchestrator), as otherwise you might have two parallel runs that are both trying to install different versions of pangeo-forge-recipes at the same time! And IMO, that pathway isn't something we should go down, as that'll make us have venv-specific code that we'll have to rip out.

An advantage I see here, entirely apart from the possible usefulness for orchestrator, is that if I am just a "regular" pangeo-forge-runner user, trying out recipes from the command line, then I don't need to manually modify my environment each time I call pangeo-forge-runner on a different feedstock.

IMO, this can also be very conflicting - as a user, I don't want me calling a tool from a package to modify the environment it is in (unless it is explicitly built for that purpose, like pip). This is especially going to be problematic if my local environment has different versions of dependencies (like xarray) than what the pangeo-forge-recipes version depends on.

@yuvipanda
Copy link

This also makes me have opinions on pangeo-forge/pangeo-forge-runner#24, and I think the appropriate name for pangeo-forge-runner is probably pangeo-forge-recipes-executor, and we can have it work only on local checkouts, and split out the fetching part into another tool that can eventually also have the duties of container building.

@cisaacstern
Copy link
Member Author

Based on @yuvipanda and my offline chat just now:

  • Complexity will exist either in orchestrator Dockerfile or in dynamic provisioning of envs
  • Given size of our current orchestrator maintainer pool (very small) we prefer shifting risk into failure of dynamic env builds (and therefore perhaps the recipe contributor needing to retry) and away from frequent orchestrator rebuilds (which will bring down all of Pangeo Forge Cloud if they fail, and we don't have a lot of people who are confident in fixing this).
  • To do this, Yuvi proposes splitting pangeo-forge-runner into two tools:
    1. Something that dynamically creates an environment following checkout of a feedstock (think: pre-commit, which creates its own environment before running itself)
    2. Something that does the rest of what pangeo-forge-runner already does
  • For dynamically building these envs, Yuvi suggests we skip directly over requirements.txt and go directly to micromamba + environment.yaml. Creating something for requirements.txt will be harder to refactor into something that will work for conda, and since we know we need Conda dependencies, better to just start here.
  • The dynamic env creation tool would eventually also be where worker images could be dynamically built and pushed for each run.

@yuvipanda
Copy link

yuvipanda commented Sep 15, 2022

Thanks for the very productive conversation, @cisaacstern! <3 working with this group!

I opened pangeo-forge/pangeo-forge-runner#27 to discuss splitting the app. It's a decent refactor that'll take a while.

In the meantime, to unblock @rabernat maybe we can work on providing instructions so he can run pangeo-forge-runner locally so he can run things on the columbia GCP account dataflow to test?

@cisaacstern
Copy link
Member Author

In the meantime, to unblock @rabernat maybe we can work on providing instructions so he can run pangeo-forge-runner locally so he can run things on the columbia GCP account dataflow to test?

Yes, IMO this is the best path forward. Ryan can use pangeo-forge-runner from the command line to test arbitrary refs of pangeo-forge-recipes on Dataflow. Meanwhile, orchestrator will only support 1 version of pangeo-forge-recipes at a time, until the "environment provisioner" tool described in pangeo-forge/pangeo-forge-runner#27 is available.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants