From 2deca50d970debd63fb366d0b33900f02482af05 Mon Sep 17 00:00:00 2001 From: Brad Fitzpatrick Date: Thu, 14 Mar 2019 04:44:48 +0000 Subject: [PATCH] crfs: start of a README / design doc of sorts Updates golang/go#30829 Change-Id: I8790dfcd30e3fb4d68b6e4cb9f8baf44c45d2cd6 Reviewed-on: https://go-review.googlesource.com/c/build/+/167392 Reviewed-by: Brad Fitzpatrick --- crfs/README.md | 143 +++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 143 insertions(+) create mode 100644 crfs/README.md diff --git a/crfs/README.md b/crfs/README.md new file mode 100644 index 0000000000..d99def85fd --- /dev/null +++ b/crfs/README.md @@ -0,0 +1,143 @@ +# CRFS: Container Registry Filesystem + +Discussion: https://github.com/golang/go/issues/30829 + +## Overview + +**CRFS** is a read-only FUSE filesystem that lets you mount a +container image, served directly from a container registry (such as +[gcr.io](https://gcr.io/)), without pulling it all locally first. + +## Background + +Starting a container should be fast. Currently, however, starting a +container in many environments requires doing a `pull` operation from +a container registry to read the entire container image from the +registry and write the entire container image to the local machine's +disk. It's pretty silly (and wasteful) that a read operation becomes a +write operation. For small containers, this problem is rarely noticed. +For larger containers, though, the pull operation quickly becomes the +slowest part of launching a container, especially on a cold node. +Contrast this with launching a VM on major cloud providers: even with +a VM image that's hundreds of gigabytes, the VM boots in seconds. +That's because the hypervisors' block devices are reading from the +network on demand. The cloud providers all have great internal +networks. Why aren't we using those great internal networks to read +our container images on demand? + +## Why does Go want this? + +Go's continuous build system tests Go on [many operating systems and +architectures](https://build.golang.org/), using a mix of containers +(mostly for Linux) and VMs (for other operating systems). We +prioritize fast builds, targetting 5 minute turnaround for pre-submit +tests when testing new changes. For isolation and other reasons, we +run all our containers in a single-use fresh VMs. Generally our +containers do start quickly, but some of our containers are very large +and take a long time to start. To work around that, we've automated +the creation of VM images where our heavy containers are pre-pulled. +This is all a silly workaround. It'd be much better if we could just +read the bytes over the network from the right place, without the all +the hoops. + +## Tar files + +One reason that reading the bytes directly from the source on demand +is somewhat non-trivial is that container images are, somewhat +regrettably, represented by *tar.gz* files, and tar files are +unindexed, and gzip streams are not seekable. This means that trying +to read 1KB out of a file named `/var/lib/foo/data` still involves +pulling hundreds of gigabytes to uncompress the stream, to decode the +entire tar file until you find the entry you're looking for. You can't +look it up by its path name. + +## Introducing Stargz + +Fortunately, we can fix the fact that *tar.gz* files are unindexed and +unseekable, while still making the file a valid *tar.gz* file by +taking advantage of the properties of both tar files and gzip +compression in that you can concatenate tar files together to make +valid tar files, and you can concatenate multiple gzip streams +together and have a valid gzip stream. + +We introduce a format, **Stargz**, a **S**eekable +**tar.gz** format that's still a valid tar.gz file for everything else +that's unaware of these details. + +In summary: + +* That traditional `*.tar.gz` format is: `GZIP(TAR(file1 + file2 + file3))` +* Stargz's format is: `GZIP(TAR(file1)) + GZIP(TAR(file2)) + GZIP(TAR(file3_chunk1)) + GZIP(TAR(file3_chunk2)) + GZIP(TAR(index of earlier files in magic file))`, where the trailing ZIP-like index contains offsets for each file/chunk's GZIP header in the overall **stargz** file. + +This makes images a few percent larger (due to more gzip headers and +loss of compression context between files), but it's plenty +acceptable. + +## Converting images + +If you're using `docker push` to push to a registry, you can't use +CRFS to mount the image. Maybe one day `docker push` will push +*stargz* files (or something with similar properties) by default, but +not yet. So for now we need to convert the storage image layers from +*tar.gz* into *stargz*. There is a tool that does that. **TODO: examples** + +## Operation + +When mounting an image, the FUSE filesystem makes does a couple Docker +Registry HTTP API requests to the container registry to get the +metadata for the container and all its layers. + +It then does HTTP Range requests to read just the **stargz** index out +of the end of each of the layers. The index is stored similar to how +the ZIP format's TOC is stored, storing a pointer to the index at the +very end of the file. Generally it takes 1 HTTP request to read the +index, but no more than 2. In any case, we're assuming a fast network +(GCE VMs to gcr.io, or similar) with low latency to the container +registry. Each layer needs these 1 or 2 HTTP requests, but they can +all be done in parallel. + +From that, we keep the index in memory, so `readdir`, `stat`, and +friends are all served from memory. For reading data, the index +contains the offset of each file's `GZIP(TAR(file data))` range of the +overall *stargz* file. To make it possible to efficiently read a small +amount of data from large files, there can actually be multiple +**stargz** index entries for large files. (e.g. a new gzip stream +every 16MB of a large file). + +## Union/overlay filesystems + +CRFS can do the aufs/overlay2-ish unification of multiple read-only +*stargz* layers, but it will stop short of trying to unify a writable +filesystem layer atop. For that, you can just use the traditional +Linux filesystems. + +## Using with Docker, without modifying Docker + +Ideally container runtimes would support something like this whole +scheme natively, but in the meantime a workaround is that when +converting an image into *stargz* format, the converter tool can also +produce an image variant that only has metadata (environment, +entrypoints, etc) and no file contents. Then you can bind mount in the +contents from the CRFS FUSE filesystem. + +That is, the convert tool can do: + +**Input**: `gcr.io/your-proj/container:v2` + +**Output**: `gcr.io/your-proj/container:v2meta` + `gcr.io/your-proj/container:v2stargz` + +What you actually run on Docker or Kubernetes then is the `v2meta` +version, so your container host's `docker pull` or equivalent only +pulls a few KB. The gigabytes of remaining data is read lazily via +CRFS from the `v2stargz` layer directly from the container registry. + +## Status + +WIP. Enough parts are implemented & tested for me to realize this +isn't crazy. I'm publishing this document first for discussion while I +finish things up. Maybe somebody will point me to an existing +implementation, which would be great. + +## Discussion + +See https://github.com/golang/go/issues/30829