-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
usecase: building patches on top of an OCI container for ML #15
Comments
@d4l3k the oci-python here is typically not for an implementation of a container runtime or distribution, but it's a way to interact with the standards (e.g., image spec, digests, etc.)
okay so building an image without docker - and in Python! The latter (push and pull) from a registry is very easy / doable - that's how I first implemented being able to pull from docker hub down to singularity without needing docker. But actually building the image usually requires some dependencies - e.g., here is an example that uses runc / skopeo / umoci on the backend. one library that I know of implementing the spec to some degree, and in full Python, is Charliecloud - perhaps that would be a first shot to explore? I also started to work on oras-python but (on my own) I couldn't quite figure out the design, and started to mimic the Go code which was a mistake.
I do agree that this particular need is fairly straight forward!
And strangely this does appear to be the case, although I haven't searched fully. So how about this for a proposal - give Charliecloud a try to see if it can work to import and use for different functionality (and do some more searching to look for others, I'm surprised I couldn't find any in my quick search just now) and if that doesn't work, we can put together a little client. It probably doesn't even need to use oci python here, because really the oci libs are intended for Go where you need to define data structures and what not in advance. But if we are redundantly validating hashes and whatnot it wouldn't hurt! I think it would be a good opportunity for me to refactor a bit here. So TLDR: yes I'm definitely interested, but make sure that what you need isn't already out there, and probably we can create a new repository for this library. |
Thanks for the quick reply! Good to know that push and pull is easy :) When I say build I really mean tarball some files and add it as a layer to an existing manifest. I'm not looking to support anything as complex as Dockerfile. The current ML packaging solutions tend to either rely on full Docker (which is a bit of a steep onboarding for ML researchers) or a more bespoke solution such as https://cloud.google.com/ai-platform/training/docs/packaging-trainer which is built around python packages and is also fairly clunky. For a lot of ML jobs all you really need to do is to take a pre-existing container (such as https://hub.docker.com/r/pytorch/pytorch) and slap your model code on top of it and launch it to a cluster. Just supporting tarballs as a new layer is the minimum needed for that use case. For more advanced stuff a user would likely use a more full fledged tool like docker to build a new base image. Thanks for the pointer to charliecloud, I hadn't seen that before |
Sure thing! Let me know if you want to work on something, definitely sounds fun :) |
Playing around with it right now, definitely will share if I get something working! |
got this working with a mishmash of interfaces. It's not pretty but it works. It'll need some cleanup and might see what makes sense to merge back into this repo. from opencontainers.distribution import reggie
import requests
import io
import tarfile
import hashlib
import os
import os.path
import json
import gzip
dst_name = "torchx"
dst_ref = "tristanr_patched"
src_endpoint = "https://ghcr.io"
dst_endpoint = "https://<id>.dkr.ecr.us-west-2.amazonaws.com"
src = reggie.NewClient(
src_endpoint,
reggie.WithDefaultName("pytorch/torchx"),
)
with open("ecr.passwd", "rt") as f:
password = f.read()
dst_auth = ("AWS", password)
dst = reggie.NewClient(
dst_endpoint,
reggie.WithUsernamePassword("AWS", password),
)
req = src.NewRequest(
"GET",
"/v2/<name>/manifests/<reference>",
reggie.WithReference("0.1.2dev0"),
)
resp = src.Do(req)
manifest = resp.json()
print(manifest)
layers = manifest["layers"]
config_digest = manifest["config"]["digest"]
def get_blob_raw(digest):
req = src.NewRequest("GET", "/v2/<name>/blobs/<digest>", reggie.WithDigest(digest))
req.stream = True
return src.Do(req)
def get_blob(digest):
return get_blob_raw(digest).json()
config = get_blob(config_digest)
print(config)
wd = config["container_config"]["WorkingDir"]
PATCH_FILE = "patch.tar.gz"
with tarfile.open(PATCH_FILE, mode="w:gz") as tf:
content = b"blah blah"
info = tarfile.TarInfo(os.path.join(wd, "test.txt"))
info.size = len(content)
tf.addfile(info, io.BytesIO(content))
def digest_str(s):
m = hashlib.sha256()
m.update(s)
return "sha256:" + m.hexdigest()
def compute_digest(reader):
m = hashlib.sha256()
patch_size = 0
while True:
data = f.read(64000)
if not data:
break
m.update(data)
patch_size += len(data)
patch_digest = "sha256:" + m.hexdigest()
return patch_digest, patch_size
with open(PATCH_FILE, "rb") as f:
patch_digest, patch_size = compute_digest(f)
with gzip.open(PATCH_FILE, "rb") as f:
diff_digest, _ = compute_digest(f)
manifest["layers"].append(
{
"mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip",
"size": patch_size,
"digest": patch_digest,
}
)
def blob_exists(digest):
resp = requests.head(
dst_endpoint + f"/v2/{dst_name}/blobs/{digest}",
auth=dst_auth,
)
return resp.status_code == requests.codes.ok
def upload(digest, blob):
if hasattr(blob, "__len__"):
size = len(blob)
else:
size = os.fstat(blob.fileno()).st_size
print(f"uploading {digest}, len {size}")
resp = requests.post(
dst_endpoint + f"/v2/{dst_name}/blobs/uploads/?digest={digest}",
data=blob,
headers={
"Content-Length": str(size),
},
auth=dst_auth,
)
resp.raise_for_status()
def upload_manifest(manifest):
resp = requests.put(
dst_endpoint + f"/v2/{dst_name}/manifests/{dst_ref}",
headers={
"Content-Type": manifest["mediaType"],
},
data=json.dumps(manifest),
auth=dst_auth,
)
if resp.status_code != requests.codes.ok:
print(resp.content)
resp.raise_for_status()
print(resp, resp.headers)
with open(PATCH_FILE, "rb") as f:
upload(patch_digest, f)
class ResponseReader:
def __init__(self, resp):
self.resp = resp
self.mode = "rb"
def read(self, n):
return self.resp.raw.read(n)
def __len__(self):
return int(self.resp.headers["Content-Length"])
to_upload = [layer["digest"] for layer in manifest["layers"]]
config["rootfs"]["diff_ids"].append(diff_digest)
config_json = json.dumps(config)
config_digest = digest_str(config_json.encode("utf-8"))
upload(config_digest, config_json)
manifest["config"]["digest"] = config_digest
for digest in to_upload:
if blob_exists(digest):
print(f"blob exists {digest}")
continue
resp = get_blob_raw(digest)
reader = ResponseReader(resp)
upload(digest, reader)
upload_manifest(manifest)
|
Thanks! I should be able to make some time this weekend. |
hey @d4l3k ! So I've started us a PR where we can hopefully address some of the challenges you faced. I'm not able to test the auth issues (so I'll need your insight / contribution for the PR) to fix any bugs that you might have found. For the example that you have above, I think we have two approaches that we can take. Either we provide more example in the docs (which I started to do, as some interactions above are just using the client fairly straight forward) but I'm also thinking it might make sense to provide (if not example) some kind of helper functions to do these standard interactions. So basically I could imagine some combination of:
Let me know your thoughts! Feel free to grab the branch and work on it, or have more discussion here. |
While docs/examples are nice there's two main improvements that would be nice from the library: 1. Fix Basic AuthHonestly to simplify you could just rip out the reggie retry logic and use requests username/password support. Not sure if there's anything special in there, the docker v2 spec doesn't say much about that. 2. Better Credential Handlingi.e. load usernames/passwords for the remote registries from the environment. I'm currently using If they have a credential store configured that's not possible though |
I would be down for both of those (and I was hoping you'd be interested to contribute via a PR?) I think we could probably maintain the style of functions (e.g., With.X) but use requests on the backend to have a simpler approach. And the WithEnvAuth is a great idea - if we add here we can suggest to the upstream! |
Yup, if I end up going this route for TorchX I'd be happy to submit a PR to fixup the auth stuff. Not sure on the exact timing for that, working on a bunch of stuff in parallel :) Appreciate all the help on this! This project was a big help in getting the proof of concept impl done |
@d4l3k @vsoch hi all! I have an alternate solution -- add files directly to an upstream OCI image. AppendLayer
https://pypi.org/project/appendlayer/ Their usecase is Apache Airflow. Mine is more general: I want Developers to be able to redeploy a Docker/Kubernetes image with the same nonchalance as saving a file. Hope this helps! |
Another alternate solution: Docker (BuildKit) already supports using
|
Both of those sound amenable to me! Thanks @johntellsall |
I do think I need a demo - I'm not sure how we would be adding files to the image without download it, haha. As in, we are creating a single local layer that would then be pushed with an updated manifest (and the assumption is that the previous layers already exist?) Is there a special flag for that? |
I'm currently working on https://github.com/pytorch/torchx which is a project trying to make it easier to train and deploy ML models.
Quite a few of the cloud services / cluster tools for running ML jobs use OCI/Docker containers so I've been looking into how to make dealing with these easier.
Container based services:
TorchX currently supports patches on top of existing images to make it fast to iterate and then launch a training job. These patches are just overlaying files from the local directory on top of a base image. Our current patching implementation relies on having a local docker daemon to build a patch layer and push it: https://github.com/pytorch/torchx/blob/main/torchx/schedulers/docker_scheduler.py#L437-L493
Ideally we could build a patch layer and push it in pure Python without requiring any local docker instances since that's an extra burden on ML researchers/users. Building a patch should be fairly straightforward since it's just appending to a layer and pushing will require some ability to talk to the registry to download/upload containers.
It seems like OCI containers are a logical choice to use for packaging ML training jobs/apps but the current Python tooling is fairly lacking as far as I can see.
@vsoch curious what your thoughts are and if that's something you'd be interested in having merged into this repo
The text was updated successfully, but these errors were encountered: