Skip to content

Commit

Permalink
crfs: start of a README / design doc of sorts
Browse files Browse the repository at this point in the history
Updates golang/go#30829

Change-Id: I8790dfcd30e3fb4d68b6e4cb9f8baf44c45d2cd6
Reviewed-on: https://go-review.googlesource.com/c/build/+/167392
Reviewed-by: Brad Fitzpatrick <[email protected]>
  • Loading branch information
bradfitz committed Mar 14, 2019
1 parent 3bfcc9b commit 2deca50
Showing 1 changed file with 143 additions and 0 deletions.
143 changes: 143 additions & 0 deletions crfs/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,143 @@
# CRFS: Container Registry Filesystem

Discussion: https://github.com/golang/go/issues/30829

## Overview

**CRFS** is a read-only FUSE filesystem that lets you mount a
container image, served directly from a container registry (such as
[gcr.io](https://gcr.io/)), without pulling it all locally first.

## Background

Starting a container should be fast. Currently, however, starting a
container in many environments requires doing a `pull` operation from
a container registry to read the entire container image from the
registry and write the entire container image to the local machine's
disk. It's pretty silly (and wasteful) that a read operation becomes a
write operation. For small containers, this problem is rarely noticed.
For larger containers, though, the pull operation quickly becomes the
slowest part of launching a container, especially on a cold node.
Contrast this with launching a VM on major cloud providers: even with
a VM image that's hundreds of gigabytes, the VM boots in seconds.
That's because the hypervisors' block devices are reading from the
network on demand. The cloud providers all have great internal
networks. Why aren't we using those great internal networks to read
our container images on demand?

## Why does Go want this?

Go's continuous build system tests Go on [many operating systems and
architectures](https://build.golang.org/), using a mix of containers
(mostly for Linux) and VMs (for other operating systems). We
prioritize fast builds, targetting 5 minute turnaround for pre-submit
tests when testing new changes. For isolation and other reasons, we
run all our containers in a single-use fresh VMs. Generally our
containers do start quickly, but some of our containers are very large
and take a long time to start. To work around that, we've automated
the creation of VM images where our heavy containers are pre-pulled.
This is all a silly workaround. It'd be much better if we could just
read the bytes over the network from the right place, without the all
the hoops.

## Tar files

One reason that reading the bytes directly from the source on demand
is somewhat non-trivial is that container images are, somewhat
regrettably, represented by *tar.gz* files, and tar files are
unindexed, and gzip streams are not seekable. This means that trying
to read 1KB out of a file named `/var/lib/foo/data` still involves
pulling hundreds of gigabytes to uncompress the stream, to decode the
entire tar file until you find the entry you're looking for. You can't
look it up by its path name.

## Introducing Stargz

Fortunately, we can fix the fact that *tar.gz* files are unindexed and
unseekable, while still making the file a valid *tar.gz* file by
taking advantage of the properties of both tar files and gzip
compression in that you can concatenate tar files together to make
valid tar files, and you can concatenate multiple gzip streams
together and have a valid gzip stream.

We introduce a format, **Stargz**, a **S**eekable
**tar.gz** format that's still a valid tar.gz file for everything else
that's unaware of these details.

In summary:

* That traditional `*.tar.gz` format is: `GZIP(TAR(file1 + file2 + file3))`
* Stargz's format is: `GZIP(TAR(file1)) + GZIP(TAR(file2)) + GZIP(TAR(file3_chunk1)) + GZIP(TAR(file3_chunk2)) + GZIP(TAR(index of earlier files in magic file))`, where the trailing ZIP-like index contains offsets for each file/chunk's GZIP header in the overall **stargz** file.

This makes images a few percent larger (due to more gzip headers and
loss of compression context between files), but it's plenty
acceptable.

## Converting images

If you're using `docker push` to push to a registry, you can't use
CRFS to mount the image. Maybe one day `docker push` will push
*stargz* files (or something with similar properties) by default, but
not yet. So for now we need to convert the storage image layers from
*tar.gz* into *stargz*. There is a tool that does that. **TODO: examples**

## Operation

When mounting an image, the FUSE filesystem makes does a couple Docker
Registry HTTP API requests to the container registry to get the
metadata for the container and all its layers.

It then does HTTP Range requests to read just the **stargz** index out
of the end of each of the layers. The index is stored similar to how
the ZIP format's TOC is stored, storing a pointer to the index at the
very end of the file. Generally it takes 1 HTTP request to read the
index, but no more than 2. In any case, we're assuming a fast network
(GCE VMs to gcr.io, or similar) with low latency to the container
registry. Each layer needs these 1 or 2 HTTP requests, but they can
all be done in parallel.

From that, we keep the index in memory, so `readdir`, `stat`, and
friends are all served from memory. For reading data, the index
contains the offset of each file's `GZIP(TAR(file data))` range of the
overall *stargz* file. To make it possible to efficiently read a small
amount of data from large files, there can actually be multiple
**stargz** index entries for large files. (e.g. a new gzip stream
every 16MB of a large file).

## Union/overlay filesystems

CRFS can do the aufs/overlay2-ish unification of multiple read-only
*stargz* layers, but it will stop short of trying to unify a writable
filesystem layer atop. For that, you can just use the traditional
Linux filesystems.

## Using with Docker, without modifying Docker

Ideally container runtimes would support something like this whole
scheme natively, but in the meantime a workaround is that when
converting an image into *stargz* format, the converter tool can also
produce an image variant that only has metadata (environment,
entrypoints, etc) and no file contents. Then you can bind mount in the
contents from the CRFS FUSE filesystem.

That is, the convert tool can do:

**Input**: `gcr.io/your-proj/container:v2`

**Output**: `gcr.io/your-proj/container:v2meta` + `gcr.io/your-proj/container:v2stargz`

What you actually run on Docker or Kubernetes then is the `v2meta`
version, so your container host's `docker pull` or equivalent only
pulls a few KB. The gigabytes of remaining data is read lazily via
CRFS from the `v2stargz` layer directly from the container registry.

## Status

WIP. Enough parts are implemented & tested for me to realize this
isn't crazy. I'm publishing this document first for discussion while I
finish things up. Maybe somebody will point me to an existing
implementation, which would be great.

## Discussion

See https://github.com/golang/go/issues/30829

0 comments on commit 2deca50

Please sign in to comment.