RFC: Improve OCI Image Python Tooling #388
Labels
enhancement
New feature or request
kubernetes
kubernetes and volcano schedulers
RFC
Request for Feedback & Roadmaps
slurm
slurm scheduler
Description
Quite a few of the cloud services / cluster tools for running ML jobs use OCI/Docker containers so I've been looking into how to make dealing with these easier.
Container based services:
TorchX currently supports patches on top of existing images to make it fast to iterate and then launch a training job. These patches are just overlaying files from the local directory on top of a base image. Our current patching implementation relies on having a local docker daemon to build a patch layer and push it: https://github.com/pytorch/torchx/blob/main/torchx/schedulers/docker_scheduler.py#L437-L493
Ideally we could build a patch layer and push it in pure Python without requiring any local docker instances since that's an extra burden on ML researchers/users. Building a patch should be fairly straightforward since it's just appending to a layer and pushing will require some ability to talk to the registry to download/upload containers.
It seems like OCI containers are a logical choice to use for packaging ML training jobs/apps but the current Python tooling is fairly lacking as far as I can see. Making it easier to work with this will likely help with the cloud story.
Detailed Proposal
Create a library for Python to manipulate OCI images with the following subset of features:
Non-goals:
Alternatives
Additional context/links
There is an existing oci-python library but it's fairly early. May be able to build upon it to enable this.
I opened an issue there as well: vsoch/oci-python#15
The text was updated successfully, but these errors were encountered: