x/build: speed up large container start-up times without pre-pulling containers into VMs (CRFS) #30829

bradfitz · 2019-03-14T04:40:53Z

Tracking bug for improving how we maintain & deploy our larger builder environment containers easily and quickly while also having them start up quickly.

Our current situation (building a container, pushing to gcr.io, then automating the creation of a COS-like VM images that has the image pre-pulled) is pretty gross and tedious.

I propose CRFS: a Container-Registry Filesystem. See design doc at https://github.com/golang/build/tree/master/crfs#crfs-container-registry-filesystem

The gist of it is that we can read bytes from gcr.io directly with a FUSE filesystem, rather than doing huge docker pulls. It's not very hard once you tweak the tarballs into a more amenable format.

gopherbot · 2019-03-14T04:45:59Z

Change https://golang.org/cl/167392 mentions this issue: crfs: start of a README / design doc of sorts

Updates golang/go#30829 Change-Id: I8790dfcd30e3fb4d68b6e4cb9f8baf44c45d2cd6 Reviewed-on: https://go-review.googlesource.com/c/build/+/167392 Reviewed-by: Brad Fitzpatrick <[email protected]>

ktock · 2019-03-14T07:17:31Z

Interesting idea.

Maybe you know, there are some related concepts around the container world, which are aiming to make image lightweight and to boot containers faster using lazy-pull and de-duplication technology.

FILEgrain : https://github.com/AkihiroSuda/filegrain
Slacker : https://www.usenix.org/node/194431

Don't you aim to minimize image size by taking each chunks much smaller? like:

GZIP(TAR(file1_small_chunk1)) + GZIP(TAR(file1_small_chunk2)) + GZIP(TAR(file1_small_chunk3)) + GZIP(TAR(file2_small_chunk1)) + ... + GZIP(TAR(index of earlier files in magic file))

If you take chunk smaller, you can achive inter-image de-duplication on chunk level like casync and desync doing (not only partial-pulling).

Recently, I'm implementing a rough PoC which tackles similar kind of issue, (booting containers faster and minimizing image size).
Additionally, I aim to achieve it without any modification on runtime or registry, using init-like program inside container and using FUSE-in-container like technique.

container-bootfs : https://github.com/ktock/container-bootfs

Thanks.

dprotaso · 2019-03-14T15:41:51Z

Heyo, don't know if you've seen this: containerd/containerd#2968

Once that settles it should enable creating a crfs 'snapshotter' that skips pulling images and would just perform a FUSE mount.

bradfitz · 2019-03-14T15:59:13Z

@dprotaso, I hadn't seen that. Excellent. Thanks for the link!

bradfitz · 2019-03-14T16:01:04Z

@ktock, while I'm a big fan of content-addressable storage & deduplication (my https://perkeep.org/ project is all about it), it's not my goal with this project to address that. I just want fast boot times here. Storage as far as I'm concerned is free.

dprotaso · 2019-03-14T16:34:44Z

Also you might not need to reinvent the wheel with stargz

https://github.com/samtools/htslib/blob/develop/bgzf.c
https://github.com/biogo/hts/tree/master/bgzf

Another interesting thing from: http://samtools.github.io/hts-specs/SAMv1.pdf

It is worth noting that there is a known bug in the Java GZIPInputStream class that concatenated gzip archives cannot be successfully decompressed by this class. BGZF files can be created and manipulated using the built-in Java util.zip package, but naive use of GZIPInputStream on a BGZF file will not work due to this bug.

glyn · 2019-03-14T16:58:36Z

I just wanted to check that, if this feature goes ahead, it won't be bundled into the standard library as that seems inappropriate to me.

bradfitz · 2019-03-14T17:04:15Z

@glyn, no, that won't happen. That would be entirely bizarre. The Go team writes a lot of code but very little of it goes into the standard library. I even added the FAQ entry that says we don't want most code in the standard library: https://golang.org/doc/faq#x_in_std

lukasheinrich · 2019-03-14T17:17:45Z

Hi -- just commenting here to link this to an issue within containerd which seems to tackle a similar problem as described here containerd/containerd#2943 (comment)

stevvooe · 2019-03-14T22:52:45Z

@bradfitz This is a very cool hack.

It might be worth just turning off layer compression (easier said that done, but works with standard docker once you push that way), then just use transport compression when fetching the individual file chunks. That might complicate backend storage a bit, which might have to use a different compression technique, but the images would be runnable by an unmodified docker daemon.

It's at least worth a look. ;)

bradfitz · 2019-03-15T01:09:48Z

@stevvooe, you'd still need an index somewhere. If you already need to push modified or additional layers to hold the index, might as well also compress it all?

gopherbot · 2019-03-15T19:45:47Z

Change https://golang.org/cl/167769 mentions this issue: crfs/stargz: add start of package

Basic API, format, tests. Good enough checkpoint. Updates golang/go#30829 Change-Id: Iaec5b205314d64fca5056f6b19a7bae52e5cef94 Reviewed-on: https://go-review.googlesource.com/c/build/+/167769 Reviewed-by: Brad Fitzpatrick <[email protected]>

gopherbot · 2019-03-16T05:52:00Z

Change https://golang.org/cl/167920 mentions this issue: crfs/stargz: add basic file reading, chunking big files, more tests, docs

bradfitz · 2019-03-16T20:05:47Z

@stevvooe, my index comment was slightly unrelated in retrospect. You're probably more concerned about runtime CPU usage for decoding gzip for reads, eh? Turning off layer compression should indeed solve that, but would increase the $$$ cost for image storage. And I'm unsure both whether a) gcr.io supports transport compression (probably), and b) whether it's even worth it inside a very fast network.

dmitshur · 2019-03-16T20:41:23Z

@bradfitz I've read the original issue and the linked design doc in full, which helped me understand this better, but I still have an unanswered question about this part:

The gist of it is that we can read bytes from gcr.io directly with a FUSE filesystem, rather than doing huge docker pulls.

I understand one of the benefits is the ability to stream the container image, so parts of it can start being accessed sooner, instead of waiting for the entire container image to be downloaded before the first byte can be read.

But is there also an advantage that a typical workload would read less bytes than the entire container image contains? I.e., only a small subset is typically needed, so the savings are also that less bytes need to be downloaded in total?

Updates golang/go#30829 Change-Id: I1ce8c1cbfa580c372341af63ed161e421103fad4 Reviewed-on: https://go-review.googlesource.com/c/build/+/167920 Reviewed-by: Brad Fitzpatrick <[email protected]>

gopherbot · 2019-03-22T02:46:36Z

Change https://golang.org/cl/168737 mentions this issue: crfs/stargz/stargzify: add tool to convert a tar.gz to stargz

gopherbot · 2019-03-22T02:46:37Z

Change https://golang.org/cl/168799 mentions this issue: crfs, stargz: basics of read-only FUSE filesystem, directory support

And in testing converting the Debian base layer I found a hard link, so add enough hardlink support (mostly in TODO form) for the tool to run for now. Proper hardlink support later. Size stats: -rw-r--r-- 1 bradfitz bradfitz 51354364 Mar 3 03:32 debian.tar.gz -rw-r--r-- 1 bradfitz bradfitz 55061714 Mar 21 20:37 debian.stargz About 7.6% bigger. (Acceptable) Updates golang/go#30829 Change-Id: I4d76850be68d32ea6e8c2bd81c4233df1b5fc7af Reviewed-on: https://go-review.googlesource.com/c/build/+/168737 Reviewed-by: Jon Johnson <[email protected]>

No network support yet. But this implements the basic FUSE support reading from a local stargz file. Updates golang/go#30829 Change-Id: I342e957b3b36cded5aec8b1cdca65c3f5e788db3 Reviewed-on: https://go-review.googlesource.com/c/build/+/168799 Reviewed-by: Maisem Ali <[email protected]> Reviewed-by: Brad Fitzpatrick <[email protected]>

dw · 2019-03-22T21:14:38Z

Hi Brad,

I came via HN :) Cool project, just a few thoughts:

It's possible to do 'solid' compression while retaining the same level of compatibility as done here, the benefit is not resetting the compressor for small files. Looks like regular chunk size also makes it possible to drop at least one TOCEntry field

Regarding TOCEntry, some kind of sorted array that does not require full decoding rather than a recursive structure would make the format far more appealing for reuse, and also reduces the runtime requirements for any parser

One place to look for design inspiration might be squashfs, it's solving a similar problem although its constraints are a little looser. For example squashfs does not store a single large index, subdirectories have their own separate representation

bradfitz · 2019-03-22T22:36:33Z

@dw, thanks. I was meaning to explore grouping small files together into one gzip stream but first I want to get all the pieces working before I optimize too much. For now a 7% bloat is acceptable.

Looks like regular chunk size also makes it possible to drop at least one TOCEntry field

Yeah, there's a lot of redundant info in there (including the name, which stores its full path), but I liked the flexibility to perhaps do file-specific chunk sizes in the future based on known access patterns for different types of files.

Regarding TOCEntry, some kind of sorted array that does not require full decoding rather than a recursive structure would make the format far more appealing for reuse, and also reduces the runtime requirements for any parser

Yeah, the JSON is slightly inefficient, but I figured it's okay to just slurp the whole thing in at start-up (for all layers) and keep it all in memory. It's not big (at least for the layers I've seen or work with), so I didn't want to prematurely optimize. But people with millions of files in their layers might not find it as acceptable.

bradfitz · 2019-03-22T22:36:38Z

CRFS is now at https://github.com/google/crfs

cben · 2019-03-26T15:11:39Z

It might be worth just turning off layer compression (...), then just use transport compression when fetching the individual file chunks

I'm not sure this would be workable.

At large scale, registries might be unhappy to waste storage and CPU compressing on the fly :)
If you mean Transfer-Encoding: gzip, that's rare on servers and just being added to Go http client (net/http: support "gzip" as a Transfer Encoding #29162) but in principle would be clean — Range requests allow seeking by offsets in original uncompressed tar file.
Alas, it is gone in HTTP 2: Transfer-codings httpwg/http2-spec#445, so in the long term is a dead end :-(
static Content-Encoding: gzip is what we already have: the server stores it pre-compressed, but can't seek using tar metadata because don't know how it maps to compressed offsets.
If you mean on-the-fly compression via Content-Encoding: gzip, which is the widely deployed http compression, it's not what you want :-(. The way HTTP defined Content-Encoding essentially matches a static pre-compressed resource, which means Range queries index by compressed offsets, back to the problem Stargz solves.

giuseppe · 2019-05-28T12:38:33Z

I've attempted a hacky integration of CRFS with fuse-overlayfs, I am still playing with it but it is already fine as PoC. It should solve the problem of having a working overlay implementation, more details here: containers/fuse-overlayfs#79

bradfitz · 2019-05-28T17:06:58Z

@giuseppe, nice! I'd been meaning to try fuse-overlayfs as I kept hitting ESTALE errors with the kernel overlayfs against crfs. I did a bunch of work (locally, not pushed) to make sure inode numbers are stable and even added name_to_handle_at/etc wrappers in golang/sys@9f0b1ff as part of debugging it, but I've never been able to make the kernel overlayfs happy for prolonged periods of time. (It works for a bit, but then starts returning ESTALE errors, and IIRC unrelated to dropping caches)

That's why CRFS kinda hit a pause, working through and getting stuck on that.

Your progress unblocks things. Thanks for the demo of using podman, too. That was my hand-wavvy plan, to run the overlay-merged container directly with runc or something but I hadn't started down that path yet.

ktock · 2019-09-25T16:26:17Z

Recently I'm working on CRFS to make it work with overlayfs, which should be indispensable to integrate CRFS with container runtimes. Finally I found the cause of the ESTALE error and submitted two PRs to achive it. Could anyone help us out with reviewing them?

@giuseppe, are you still working on CRFS? Could you help us with reviewing them?

ktock · 2019-10-10T12:36:25Z

Stargz image now works with containerd! (still, patch needed)

https://github.com/ktock/remote-snapshotter

This is still under the active discussion in containerd community, so please join and help us out!

giuseppe · 2019-10-14T07:33:37Z

@giuseppe, are you still working on CRFS? Could you help us with reviewing them?

@ktock sorry, I've missed your previous comment.

Yes, I am still interested in supporting CRFS in fuse-overlayfs. While playing with it, I've noticed that the gzip compression is a performance bottleneck with many small files, so I've proposed a change in the OCI specs to support zstd: containers/image#639. I think CRFS can benefit a lot from it.

I've added plugins support to fuse-overlayfs: containers/fuse-overlayfs#119, that can be used to retrieve data for lower layers (and leaving the writeable upper layer management to fuse-overlayfs). I'll have to write one for CRFS.

lukasheinrich · 2019-10-14T07:42:28Z

@giuseppe do you think there is a way too re-use some of the work @ktock has done within podman?

giuseppe · 2019-10-14T16:21:00Z

@giuseppe do you think there is a way too re-use some of the work @ktock has done within podman?

I think it should be possible. I've not yet looked into the integration with Podman as I am still playing with the lower level bits

DrDaveD · 2019-12-09T19:33:07Z

Also note that the mature and popular CernVM FileSystem is a general purpose caching read-only download-on-demand filesystem that could be helpful here. It includes a tool called DUCC for downloading and installing layers from a docker-type registry. I see @ktock has already heard about it in a remote-snapshotter pull request. A difference is that CVMFS uses a publishing step to prepare all the files, but we have found that doing the extra work up front is well worth it for applications that have orders of magnitude higher readers than writers. We are working on a new feature to efficiently merge a new upper layer with previously published registry layers so we will be able to scale up the publishing of containers while still avoiding doing layer-merging at run time with an overlay filesystem.

sequix · 2020-02-29T02:03:40Z

Exactly what I need now. I read through the code and design doc which mentioned no changes to docker. But from what I see CRFS seems to use FUSE to intercept read/write requests to /var/lib/docker/{image,overlay2} and convert them into HTTP range request to achieve lazy-pulling.

This requires at least a new StorageDriver, are you planning merge the new StorageDriver to the master branch of docker?

And how are you going to create the init and rw layer for container? (Slacker flattens all the layers to create a NFS clone, so it will not face this problem.)

AkihiroSuda · 2020-02-29T02:12:56Z

https://github.com/ktock/stargz-snapshotter

Expected to be adopted by containerd/CRI before Docker

sequix · 2020-02-29T02:16:36Z

Thx for this quick response, I'll learn something more about containerd.

ktock · 2020-04-23T01:21:25Z

FYI: https://github.com/containerd/stargz-snapshotter
Stargz is now available under the containerd org and works on Kubernetes. I also posted a blog about it.

bradfitz self-assigned this Mar 14, 2019

gopherbot added this to the Unreleased milestone Mar 14, 2019

gopherbot added the Builders x/build issues (builders, bots, dashboards) label Mar 14, 2019

bradfitz changed the title ~~x/build: speed up large container start-up times without pre-pulling containers into VMs~~ x/build: speed up large container start-up times without pre-pulling containers into VMs (CRFS) Mar 15, 2019

bradfitz added the Performance label May 30, 2019

ktock mentioned this issue Sep 25, 2019

Reconsider the project with standardization of OCI Distribution Spec AkihiroSuda/filegrain#21

Open

tazjin mentioned this issue Oct 30, 2019

Support CRFS' stargz format tazjin/nixery#72

Open

bradfitz removed their assignment May 3, 2020

dmitshur added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label May 5, 2020

jonjohnsonjr mentioned this issue Jun 15, 2021

Proposal: Add content-encoding support to spec for posting blobs opencontainers/distribution-spec#235

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

x/build: speed up large container start-up times without pre-pulling containers into VMs (CRFS) #30829

x/build: speed up large container start-up times without pre-pulling containers into VMs (CRFS) #30829

bradfitz commented Mar 14, 2019 •

edited

Loading

gopherbot commented Mar 14, 2019

ktock commented Mar 14, 2019 •

edited

Loading

dprotaso commented Mar 14, 2019 •

edited

Loading

bradfitz commented Mar 14, 2019

bradfitz commented Mar 14, 2019

dprotaso commented Mar 14, 2019

glyn commented Mar 14, 2019 •

edited

Loading

bradfitz commented Mar 14, 2019

lukasheinrich commented Mar 14, 2019

stevvooe commented Mar 14, 2019

bradfitz commented Mar 15, 2019

gopherbot commented Mar 15, 2019

gopherbot commented Mar 16, 2019

bradfitz commented Mar 16, 2019

dmitshur commented Mar 16, 2019

gopherbot commented Mar 22, 2019

gopherbot commented Mar 22, 2019

dw commented Mar 22, 2019 •

edited

Loading

bradfitz commented Mar 22, 2019

bradfitz commented Mar 22, 2019

cben commented Mar 26, 2019 •

edited

Loading

giuseppe commented May 28, 2019

bradfitz commented May 28, 2019

ktock commented Sep 25, 2019

ktock commented Oct 10, 2019

giuseppe commented Oct 14, 2019

lukasheinrich commented Oct 14, 2019

giuseppe commented Oct 14, 2019

DrDaveD commented Dec 9, 2019

sequix commented Feb 29, 2020

AkihiroSuda commented Feb 29, 2020

sequix commented Feb 29, 2020

ktock commented Apr 23, 2020

x/build: speed up large container start-up times without pre-pulling containers into VMs (CRFS) #30829

x/build: speed up large container start-up times without pre-pulling containers into VMs (CRFS) #30829

Comments

bradfitz commented Mar 14, 2019 • edited Loading

gopherbot commented Mar 14, 2019

ktock commented Mar 14, 2019 • edited Loading

dprotaso commented Mar 14, 2019 • edited Loading

bradfitz commented Mar 14, 2019

bradfitz commented Mar 14, 2019

dprotaso commented Mar 14, 2019

glyn commented Mar 14, 2019 • edited Loading

bradfitz commented Mar 14, 2019

lukasheinrich commented Mar 14, 2019

stevvooe commented Mar 14, 2019

bradfitz commented Mar 15, 2019

gopherbot commented Mar 15, 2019

gopherbot commented Mar 16, 2019

bradfitz commented Mar 16, 2019

dmitshur commented Mar 16, 2019

gopherbot commented Mar 22, 2019

gopherbot commented Mar 22, 2019

dw commented Mar 22, 2019 • edited Loading

bradfitz commented Mar 22, 2019

bradfitz commented Mar 22, 2019

cben commented Mar 26, 2019 • edited Loading

giuseppe commented May 28, 2019

bradfitz commented May 28, 2019

ktock commented Sep 25, 2019

ktock commented Oct 10, 2019

giuseppe commented Oct 14, 2019

lukasheinrich commented Oct 14, 2019

giuseppe commented Oct 14, 2019

DrDaveD commented Dec 9, 2019

sequix commented Feb 29, 2020

AkihiroSuda commented Feb 29, 2020

sequix commented Feb 29, 2020

ktock commented Apr 23, 2020

bradfitz commented Mar 14, 2019 •

edited

Loading

ktock commented Mar 14, 2019 •

edited

Loading

dprotaso commented Mar 14, 2019 •

edited

Loading

glyn commented Mar 14, 2019 •

edited

Loading

dw commented Mar 22, 2019 •

edited

Loading

cben commented Mar 26, 2019 •

edited

Loading