Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Improve OCI Image Python Tooling #388

Open
d4l3k opened this issue Feb 11, 2022 · 1 comment
Open

RFC: Improve OCI Image Python Tooling #388

d4l3k opened this issue Feb 11, 2022 · 1 comment
Labels
enhancement New feature or request kubernetes kubernetes and volcano schedulers RFC Request for Feedback & Roadmaps slurm slurm scheduler

Comments

@d4l3k
Copy link
Member

d4l3k commented Feb 11, 2022

Description

Quite a few of the cloud services / cluster tools for running ML jobs use OCI/Docker containers so I've been looking into how to make dealing with these easier.

Container based services:

TorchX currently supports patches on top of existing images to make it fast to iterate and then launch a training job. These patches are just overlaying files from the local directory on top of a base image. Our current patching implementation relies on having a local docker daemon to build a patch layer and push it: https://github.com/pytorch/torchx/blob/main/torchx/schedulers/docker_scheduler.py#L437-L493

Ideally we could build a patch layer and push it in pure Python without requiring any local docker instances since that's an extra burden on ML researchers/users. Building a patch should be fairly straightforward since it's just appending to a layer and pushing will require some ability to talk to the registry to download/upload containers.

It seems like OCI containers are a logical choice to use for packaging ML training jobs/apps but the current Python tooling is fairly lacking as far as I can see. Making it easier to work with this will likely help with the cloud story.

Detailed Proposal

Create a library for Python to manipulate OCI images with the following subset of features:

  • download/upload images to OCI repos
  • append layers to OCI images

Non-goals:

  • Execute containers
  • Dockerfiles

Alternatives

Additional context/links

There is an existing oci-python library but it's fairly early. May be able to build upon it to enable this.

I opened an issue there as well: vsoch/oci-python#15

@d4l3k d4l3k added enhancement New feature or request kubernetes kubernetes and volcano schedulers slurm slurm scheduler RFC Request for Feedback & Roadmaps labels Feb 11, 2022
@d4l3k d4l3k changed the title Improve OCI Image Tooling RFC: Improve OCI Image Python Tooling Feb 11, 2022
@Migsi
Copy link

Migsi commented Jan 23, 2023

I think I stumbled across this limitation just now. Was trying to get torchx running with a fresh k8s cluster using CRI-O instead of docker/containerd as the runtime and it always fails when trying to pull the image (which I immagine being only the first of a few "problematic" steps).

~$ torchx run -s kubernetes dist.ddp --script compute_world_size/main.py -j 1x1
torchx 2023-01-23 14:47:40 INFO     loaded configs from /home/user/playground/torchx_examples/torchx/examples/apps/.torchxconfig
torchx 2023-01-23 14:47:40 INFO     Checking for changes in workspace `file:///home/user/playground/torchx_examples/torchx/examples/apps`...
torchx 2023-01-23 14:47:40 INFO     To disable workspaces pass: --workspace="" from CLI or workspace=None programmatically.
torchx 2023-01-23 14:47:40 INFO     Workspace `file:///home/user/playground/torchx_examples/torchx/examples/apps` resolved to filesystem path `/home/user/playground/torchx_examples/torchx/examples/apps`
torchx 2023-01-23 14:47:40 WARNING  failed to pull image ghcr.io/pytorch/torchx:0.4.0, falling back to local: Error while fetching server API version: ('Connection aborted.', FileNotFoundError(2, 'No such file or directory'))
torchx 2023-01-23 14:47:40 INFO     Building workspace docker image (this may take a while)...

... [trace left out, can attach it if required]

Could you please confirm this is actually related to the issue you are talking about? If it indeed is, will it be enough to install the docker runtime in parallel, just to get the toolchain in the back up and running? Also, are there any other steps required to get such a setup running?

Best regards

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request kubernetes kubernetes and volcano schedulers RFC Request for Feedback & Roadmaps slurm slurm scheduler
Projects
None yet
Development

No branches or pull requests

2 participants