-
Notifications
You must be signed in to change notification settings - Fork 247
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
composefs integration in the overlay driver #1646
Conversation
5bb91bc
to
863d808
Compare
Wait...why? What metadata? |
See e.g. https://www.redhat.com/sysadmin/faster-container-image-pulls |
cfea72a
to
8cec35c
Compare
OK, yes I see this JSON format was introduced in a0ff89e A few things here. I think a few choices in this JSON format are suboptimal:
So I would still propose that this JSON format should stay as a c/storage thing, and we have a different format in composefs upstream. (Or, c/storage could switch to a different format, because I believe this tooling in c/storage is still experimental) |
Some xattrs also contain a terminating zero byte. selinux spring to mind, although i may be misremembering. |
Another thing about the times: EROFS seems to (as you'd expect) make "ctime == mtime == atime", because for a read-only system these must all be the same; (and it further optimizes the common case of squashing all timestamps in an image to a common build time). Another way to say this is that it doesn't support any of these times having different values, and hence if we care about this use case (as is implicit in these being different values in the JSON here) we have a problem. I somehow doubt there are container images out in the world where something actually depends on having the atime being honored even if it's encoded in the tar stream. |
Well, I don't want to spend too much time going down this rabbit hole, but it sure looks to me like the default GNU tar option won't serialize atime/ctime, and GNU tar is commonly used to make "base images". And it looks to me like the same is true of the current c/storage code; I don't see it assigning to Anyways so...do we really need to care about atime/ctime? I am doubtful. But regardless again, it can't be in |
Another concern I have with all of this is that we're going to be passing this JSON to C code, which pretty quickly hands it off to libcomposefs (also in C) - and I don't think the current libcomposefs C library is written in a "tries hard to be resistant to malicious input" style. And this JSON may contain malicious input. (I see that But more generally, we should probably try to better define the security boundaries here. In general I think the chromium "Rule of 2" is a good idea. OK, in this case we are at least using the Go JSON parser first, then passing that pre-validated buffer to C, but still. |
the xattrs are encoded in base64 so we can use a string: https://github.com/containers/storage/blob/main/pkg/chunked/compressor/compressor.go#L378 and |
ad4c0d5
to
5fb8a30
Compare
Anyway, if there could be further use cases, some xattrs could be used to keep that instead but I think use cases of [1] https://github.com/google/crfs/blob/master/stargz/stargz.go#L127 |
174a478
to
3ca6fec
Compare
Strictly speaking, Go does not require |
3ca6fec
to
ea0149e
Compare
6de5c18
to
d065864
Compare
,
unix.O_WRONLY|unix.O_CREAT|unix.O_TRUNC|unix.O_EXCL, 0o644)
I need to pass the fd to the writer-json process, it cannot be O_CLOEXEC
Go will dup the fd when passing via the Command struct extra files
|
d065864
to
9d9b07a
Compare
I've changed it to be |
e96e7bd
to
dd1c18d
Compare
} | ||
|
||
func mountErofsBlob(blobFile, mountPoint string) error { | ||
loop, err := loopback.AttachLoopDevice(blobFile) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not setting the loopback device read-only. I think we should do that. (composefs does)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, it should enable direct-io to avoid caching metadata twice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And, ideally it should pass "noacl" if the composefs header flag that says there are no acls in the file is set (this will improve performance quite a lot).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eh, noacl is in the mount options below, not the loop properties.
} | ||
defer loop.Close() | ||
|
||
return unix.Mount(loop.Name(), mountPoint, "erofs", unix.MS_RDONLY, "") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I.e. here, we should have noacl
option if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not even sure if these are supported, I cannot find anywhere in c/storage the posix_acl_access
string. Until it is needed, I've hardcoded ro,noacl
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But if an OCI layer tar has the ACL xattr set, historically that would have been extracted to the lowerdir file, and when userspace looked at the file it would have seen the ACL, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly, I have no idea if ACLs work in the wild at all though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've dropped it for now, let's be safe and handle the optimization later. I'd need to add the check in pkg/chunked and store somewhere that the image has no ACLs.
} | ||
if erofsMount != "" { | ||
if needsIDMapping { | ||
if err := idmap.CreateIDMappedMount(erofsMount, erofsMount, idmappedMountProcessPid); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In mount.composefs I do the idmap as part of the erofs mount, using the mount_setattr syscall. Maybe its hard to do that here though. It needs the new mount api.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I need some refactoring since we already have that code in pkg/idmap
, for now I've left a TODO in the code
|
||
_, _, e1 := syscall.Syscall(unix.SYS_IOCTL, uintptr(fd), uintptr(unix.FS_IOC_ENABLE_VERITY), uintptr(unsafe.Pointer(&enableArg))) | ||
if e1 != 0 { | ||
return fmt.Errorf("failed to enable verity for %q: %w", description, e1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should silently accept EEXIST (fs-verity already enabled for the file).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, couldn't this happen if a layer file is hard-linked from another layer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, this could happen with hard link deduplication
I added a few comments, but overall I think this looks good. The main issue I see right now is that it depends on the composefs-from-json helper, which is currenly not installed from composefs. We have to chose how to do this. Alternatives:
I don't know how painful 3 is, calling C is always icky. However, we do already parse the json in the go code, so not having to do that twice is a win. |
dd1c18d
to
2b27981
Compare
The file descriptor was not closed before, thus leaking all the opened files. Signed-off-by: Giuseppe Scrivano <[email protected]>
Signed-off-by: Giuseppe Scrivano <[email protected]>
so that they can be stored by their digest Signed-off-by: Giuseppe Scrivano <[email protected]>
Signed-off-by: Giuseppe Scrivano <[email protected]>
Signed-off-by: Giuseppe Scrivano <[email protected]>
2b27981
to
28f68ab
Compare
I've looked into the comments that do not require touching other packages, I can take care of other stuff later with more targetted PR. The composefs integration though works only with single layered images at the moment, since multiple data layers conflict with each other, so this still needs the "data only" feature in overlay and we can avoid creating the whiteout files as part of them |
Data only overlayfs support has landed and will be in 6.5 btw. |
Given this is not enabled by default currently i think it makes a lot to merge this as is. |
This commit introduces support for ComposeFS using the EROFS filesystem to mount the file system metadata. The current implementation allows each layer to be mounted individually. Only images that are using the zstd:chunked and eStargz format can be used in this way since the metadata is stored in the image itself. In future support for arbitrary images can be added. Signed-off-by: Giuseppe Scrivano <[email protected]>
at the moment it is a best-effort implementation to enable fs-verity for the composefs blob as well as for the data files. Signed-off-by: Giuseppe Scrivano <[email protected]>
28f68ab
to
1c76934
Compare
I've a WIP branch with a CGo implementation: https://github.com/giuseppe/storage/tree/chunked-libcomposefs-Cgo so we could replace the current I'd need to do some sandboxing and ideally re-execute Podman. I've dropped the
I am trying to reduce the number of changes so we can finally merge this PoC and iterate again from here, and also don't want to get back from PTO straight to merge conflicts :-) |
LGTM |
opened a PR to handle the |
ComposeFS has been integrated into the overlay driver. The EROFS filesystem is now used to mount the file system metadata. Each layer is mounted individually, but we can change this later and merge multiple layers into a single file system.
Filesystem-level integrity verification is now enabled through the implementation of fs-verity. This enhancement ensures the integrity of both the composefs blob and data files, providing an additional layer of security. Note that this implementation is currently considered a best-effort attempt since there is no control yet on how to configure it as well as there is no support from the overlay driver in the kernel (it is being worked on upstream).
It requires the
composefs-from-json
tool installed and accessible in the path from the https://github.com/containers/composefs project.