Introduce new type PAYLOAD_LINK #1443

giuseppe · 2018-02-02T13:34:21Z

It is used to keep track of the payload checksum for files stored in the repository.

The goal is that files having the same payload but different xattrs can take advantage of reflinks where supported.

More details here: https://mail.gnome.org/archives/ostree-list/2018-January/msg00012.html

cgwalters · 2018-02-03T15:24:09Z

I wonder if it'd be simpler to add a special xattr that we know about and also filter out.

giuseppe · 2018-02-05T15:57:10Z

how could we use the xattr to lookup for the payload checksum?

cgwalters · 2018-02-05T18:07:48Z

how could we use the xattr to lookup for the payload checksum?

Yeah, that's an issue. We could scan all objects, but that'd get slow unless amortized.

So...hmm. A big picture question here in my mind is the (still unsettled?) degree to which libostree provides low-level APIs for the OCI case, versus doing things at a higher level.

If we provide high level APIs, we're more free to change/optimize the implementation details later.

One other random thought here - how about only doing this for say files over 5MB (or some configurable threshold) instead? Trading off indexing overhead versus space savings.

Another issue here is I think it needs to be opt-in; otherwise we're taking up extra space for "pure libostree" users who aren't doing containers/rpms/whatever on top.

giuseppe · 2018-02-06T10:48:31Z

yes good point. I've added a new commit that implements a minimum threshold configuration, the payload link will be created only for files that have a size greater or equal to the threshold.

Are you fine with a default value of 3 MiB?

If you are for disabling this feature by default, we can setup a much bigger value by default, that has the same effect.

About OCI, we should probably have a bigger discussion around it. If we deal with it directly in OSTree, we would end up duplicating a lot of functionalities that are already in containers/image. For a third party project can be painful to interact both with Golang and C but having the OCI bits in containers/image has its advantages. OSTree storage can use/be used with all the other tools that are already using containers/image, such as Buildah.

One feature I particularly like is:

containers/image#392

In addition to copy images to the OSTree storage, we can also copy them back to other storages, like to the Docker engine or to a registry.

The OSTree storage part of containers/image even if used only for system containers, is a direct mapping from OCI to OSTree, that can be used for any kind of OCI images. There are no system containers features in the storage part.

giuseppe · 2018-02-12T10:46:41Z

Any feedback on the design?

cgwalters · 2018-02-13T16:20:37Z

Are you fine with a default value of 3 MiB?

But that's still imposing an unnecessary cost on everyone who is doing "pure ostree" embedded devices, etc. We could have it be a repo flag like

$ cat /ostree/repo/config 
[core]
repo_version=1
mode=bare
contentindex=true

or something. And for Project Atomic systems we'd enable that flag.

Or perhaps have an ostree_repo_enable_content_indexing().

But the configuration entrypoint aside, my main concern here is we're basically doubling the number of files we create; on this workstation:

$ find /ostree/repo/objects/ -name *.file |wc -l
150393
$ find /ostree/repo/objects/ |wc -l
169358

And like I said before I'm already not totally happy with how many small metadata files we have. In fact it'd be interesting to investigate not hardlinking for small objects at all. Is it really worth it to do hardlinks for e.g. < 1k files? Less pressure on the filesystem journal to do inode updates?

Another concern is that now pruning objects isn't atomic; it looks like you made the commit process ignore "dangling" payload links, but it should probably delete them so they can be recreated.

cgwalters · 2018-02-13T16:24:56Z

Also big picture, we know this is an issue for SELinux systems, but what about SMACK/AppArmor? IIRC SMACK uses xattrs but I don't know how aggressive it is with distinct labels like SELinux is.

TBH I don't think it's too worth your time digging into SMACK/AppArmor, but there are definitely ostree users who don't use SELinux, and in that case again we don't need the context indexing, right?

giuseppe · 2018-02-13T19:03:54Z

the number of files >3MiB is a small subset of all the files, this is what I have in the my local repository:

$ find -name '*.file' | wc -l
9603
$ find -name '*.payload-link' | wc -l
10

but anyway, it still requires the support from the file system to be useful, so I will change the default to infinite for now, or do you prefer a different option?

Yes I am not happy as well about the pruning algorithm, do you think we should recalculate everything? I didn't implement this way as it looks quite expensive (prune will be as expensive as fsck), maybe I can delete them all and recreate only those for files that are still present in the repository after the pruning.

cgwalters · 2018-02-13T19:10:43Z

Oh wow, you're right...that's actually pretty amazing; on my current workstation the ratio of "3MB+" objects is 211/150393 or 0.1%. At a quick glance there's statically linked golang binaries, but also fonts, qemu for some reason, and the kernel/initramfs.

cgwalters · 2018-02-13T19:12:33Z

And on a stock fedora-atomic:fedora/27/x86_64/atomic-host 27.61 (2018-01-17 15:52:47) with no packages layered, the ratio is 40/26030, also at 0.1%.

cgwalters · 2018-02-13T19:19:37Z

OK so there's a higher level issue here too I just thought of: this code only has an effect for commits created locally (as we do when importing OCI). When we're pulling objects directly via libostree (ostree pull, "rpm-ostree jigdo"), the client won't get this data.

So we'd have to either add this to pull (which in archive fetches would double the number of http requests, probably a non-starter), or add it to some lookaside data (messy). We can include it in static deltas easily enough though, and it should be straightforward to teach jigdo how to do it too.

cgwalters · 2018-02-13T19:21:19Z

Though hmm...since we're computing checksums at pull time now anyways usually, we could just do two SHA-256 checksums simultaneously.

giuseppe · 2018-02-15T10:11:58Z

I checked "ostree pull" and this code path also creates the .payload-link. I see this like a local optimization that should not go on the wire. Also since we allow to change the threshold for what file size must be considered, we will need the client to check if the received .payload-link must be discarded as it might point to a file bigger than core.payload-link-threshold`.

I am still chasing down the failure in the CI, although I've done some changes in the PR:

if the symlink is dangling then it is unlinked.
I've implemented a lookup in the parent repository
At pull time we always check if the link must be created, even for object files that are already present in the repository

giuseppe · 2018-02-15T12:08:22Z

tests pass again ✌️

cgwalters · 2018-02-16T18:28:57Z

src/libostree/ostree-repo-commit.c

+        file_input = (GInputStream*)checksum_input;
+      else
+        {
+          checksum_payload_input = ot_checksum_instream_new ((GInputStream*)checksum_input, G_CHECKSUM_SHA256);


It's probably worth a comment here like:

/* The payload checksum-input reads from the full object checksum-input; this * means it skips the header. */

cgwalters · 2018-02-16T22:48:41Z

src/libostree/ostree-repo-prune.c

@@ -233,6 +260,7 @@ repo_prune_internal (OstreeRepo        *self,
  g_autoptr(GHashTable) reachable_owned = g_hash_table_ref (options->reachable);
  data.reachable = reachable_owned;

+


Spurious extra newline?

dropped in the new version

cgwalters · 2018-02-16T22:49:11Z

I checked "ostree pull" and this code path also creates the .payload-link.

Yeah, I had it backwards; it's the "trusted" cases that aren't; basically whenever we aren't redoing the SHA-256. So for example ostree pull-local which is used by Anaconda in the default case means that the default install won't have payload links. We'd have to teach the pull-local code to copy them.

rh-atomic-bot · 2018-03-05T16:58:55Z

☔ The latest upstream changes (presumably 733c049) made this pull request unmergeable. Please resolve the merge conflicts.

giuseppe · 2018-03-05T18:16:28Z

@cgwalters are you fine to merge this?

cgwalters · 2018-03-05T22:39:11Z

The test is being skipped right now though: https://s3.amazonaws.com/aos-ci/ghprb/ostreedev/ostree/14c00f4e8527d8b81ffc6d06cca854b3fbd187e8.0.1520269933286777277/artifacts/gdtr-results/libostree_test-payload-link.sh.test.txt

I know our current testing setup is very confusing. I'm working on addressing that (well, some things will be more confusing, others less so) in #1462

Anyways so what you want is to add a case to tests/installed, which are basically pure shell scripts that run as root on a FAH VM and not in a container. I can probably take care of doing this if you prefer.

cgwalters · 2018-03-05T22:40:22Z

The "dev flow" I have for the tests/installed stuff is basically to hack in my dev container, then sync my git repo into the vm, and just run the script there.

it was removed with: commit 8609cb0 Author: Colin Walters <[email protected]> Date: Thu Apr 21 15:14:51 2016 -0400 repo: Simplify internal has_object() lookup code Signed-off-by: Giuseppe Scrivano <[email protected]>

It will be used by successive commits to keep track of the payload checksum for objects stored in the repository. The goal is that files having the same payload but different xattrs can take advantage of reflinks where supported. Signed-off-by: Giuseppe Scrivano <[email protected]>

giuseppe · 2018-03-05T23:20:26Z

Anyways so what you want is to add a case to tests/installed, which are basically pure shell scripts that run as root on a FAH VM and not in a container. I can probably take care of doing this if you prefer.

I've pushed a new version with the installed test script. I hope it will pass the CI :-)

cgwalters · 2018-03-06T20:26:40Z

Looks like not, see https://s3.amazonaws.com/aos-ci/ghprb/ostreedev/ostree/7a41a77d356fa518b0f3edb8124f8c7a20315bf2.8.1520324472405726051/output.log

When a new object is added to the repository, create a $PAYLOAD-SHA256.payload-link symlink file as well. The target of the symlink is the checksum of the object that was added the repository. Whenever we add a new object file, in addition to lookup if the file is already present with the same checksum we also check if an object with the same payload is in the repository. If a file with the same payload is already present in the repository, we copy it with `glnx_regfile_copy_bytes` that internally attempts to create a reflink (ioctl (..., FICLONE, ..)) to the target file if the file system supports it. This enables to have objects that share the payload but have a different inode and xattrs. By default the payload-link-threshold value is G_MAXUINT64 that disables the feature. Signed-off-by: Giuseppe Scrivano <[email protected]>

giuseppe · 2018-03-07T18:21:45Z

@cgwalters finally managed to get all the tests happy again :-)

cgwalters · 2018-03-07T18:27:26Z

tests/installed/itest-payload-link.sh

+cd ${test_tmpdir}
+
+touch a
+if cp --reflink a b; then


I'm debating a bit if we should instead just assert that this works; I mean if we give reflink=1 to XFS it had better support it. But eh. We can tweak that later.

cgwalters · 2018-03-07T18:28:56Z

Thanks for all of your work on this!

@rh-atomic-bot r+ 6994f22

rh-atomic-bot · 2018-03-07T18:29:02Z

⚡ Test exempted: merge already tested.

It will be used by successive commits to keep track of the payload checksum for objects stored in the repository. The goal is that files having the same payload but different xattrs can take advantage of reflinks where supported. Signed-off-by: Giuseppe Scrivano <[email protected]> Closes: #1443 Approved by: cgwalters

When a new object is added to the repository, create a $PAYLOAD-SHA256.payload-link symlink file as well. The target of the symlink is the checksum of the object that was added the repository. Whenever we add a new object file, in addition to lookup if the file is already present with the same checksum we also check if an object with the same payload is in the repository. If a file with the same payload is already present in the repository, we copy it with `glnx_regfile_copy_bytes` that internally attempts to create a reflink (ioctl (..., FICLONE, ..)) to the target file if the file system supports it. This enables to have objects that share the payload but have a different inode and xattrs. By default the payload-link-threshold value is G_MAXUINT64 that disables the feature. Signed-off-by: Giuseppe Scrivano <[email protected]> Closes: #1443 Approved by: cgwalters

alexlarsson · 2018-03-29T13:23:36Z

This regressed flatpak, see #1524 for fix

giuseppe force-pushed the payload-link branch 4 times, most recently from 6b696c5 to 7eadeb1 Compare February 2, 2018 15:43

giuseppe mentioned this pull request Feb 2, 2018

ostree: new option for the overlay driver containers/storage#137

Merged

giuseppe force-pushed the payload-link branch from 7eadeb1 to bb1f3a6 Compare February 6, 2018 10:11

giuseppe changed the title ~~[RFC] Introduce new type PAYLOAD_LINK~~ Introduce new type PAYLOAD_LINK Feb 9, 2018

giuseppe force-pushed the payload-link branch 4 times, most recently from 9105146 to 7b4815d Compare February 14, 2018 14:35

giuseppe force-pushed the payload-link branch 2 times, most recently from 2bf8fea to 6242cc5 Compare February 15, 2018 11:22

cgwalters reviewed Feb 16, 2018

View reviewed changes

giuseppe force-pushed the payload-link branch from 6242cc5 to 5c6862f Compare February 18, 2018 17:29

giuseppe force-pushed the payload-link branch 4 times, most recently from 1dd6833 to 9995c8e Compare March 3, 2018 18:45

giuseppe force-pushed the payload-link branch from 9995c8e to 14c00f4 Compare March 5, 2018 17:11

giuseppe added 2 commits March 6, 2018 00:06

giuseppe force-pushed the payload-link branch from 14c00f4 to 6bb1f57 Compare March 5, 2018 23:19

giuseppe force-pushed the payload-link branch from 6bb1f57 to 7a41a77 Compare March 6, 2018 08:20

giuseppe force-pushed the payload-link branch 4 times, most recently from c2ddd8f to e9ecbdc Compare March 7, 2018 10:45

giuseppe force-pushed the payload-link branch from e9ecbdc to 6994f22 Compare March 7, 2018 11:22

cgwalters reviewed Mar 7, 2018

View reviewed changes

rh-atomic-bot closed this in 418e454 Mar 7, 2018

giuseppe mentioned this pull request Apr 27, 2021

Enable zstd:chunked support in containers/image containers/storage#775

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce new type PAYLOAD_LINK #1443

Introduce new type PAYLOAD_LINK #1443

giuseppe commented Feb 2, 2018

cgwalters commented Feb 3, 2018

giuseppe commented Feb 5, 2018

cgwalters commented Feb 5, 2018

giuseppe commented Feb 6, 2018

giuseppe commented Feb 12, 2018

cgwalters commented Feb 13, 2018 •

edited

Loading

cgwalters commented Feb 13, 2018

giuseppe commented Feb 13, 2018

cgwalters commented Feb 13, 2018

cgwalters commented Feb 13, 2018

cgwalters commented Feb 13, 2018 •

edited

Loading

cgwalters commented Feb 13, 2018

giuseppe commented Feb 15, 2018

giuseppe commented Feb 15, 2018

cgwalters Feb 16, 2018

giuseppe Feb 19, 2018

cgwalters Feb 16, 2018

giuseppe Feb 19, 2018

cgwalters commented Feb 16, 2018 •

edited

Loading

rh-atomic-bot commented Mar 5, 2018

giuseppe commented Mar 5, 2018

cgwalters commented Mar 5, 2018

cgwalters commented Mar 5, 2018

giuseppe commented Mar 5, 2018

cgwalters commented Mar 6, 2018

giuseppe commented Mar 7, 2018

cgwalters Mar 7, 2018

cgwalters commented Mar 7, 2018

rh-atomic-bot commented Mar 7, 2018

alexlarsson commented Mar 29, 2018

		@@ -233,6 +260,7 @@ repo_prune_internal (OstreeRepo *self,
		g_autoptr(GHashTable) reachable_owned = g_hash_table_ref (options->reachable);
		data.reachable = reachable_owned;

Introduce new type PAYLOAD_LINK #1443

Introduce new type PAYLOAD_LINK #1443

Conversation

giuseppe commented Feb 2, 2018

cgwalters commented Feb 3, 2018

giuseppe commented Feb 5, 2018

cgwalters commented Feb 5, 2018

giuseppe commented Feb 6, 2018

giuseppe commented Feb 12, 2018

cgwalters commented Feb 13, 2018 • edited Loading

cgwalters commented Feb 13, 2018

giuseppe commented Feb 13, 2018

cgwalters commented Feb 13, 2018

cgwalters commented Feb 13, 2018

cgwalters commented Feb 13, 2018 • edited Loading

cgwalters commented Feb 13, 2018

giuseppe commented Feb 15, 2018

giuseppe commented Feb 15, 2018

cgwalters Feb 16, 2018

Choose a reason for hiding this comment

giuseppe Feb 19, 2018

Choose a reason for hiding this comment

cgwalters Feb 16, 2018

Choose a reason for hiding this comment

giuseppe Feb 19, 2018

Choose a reason for hiding this comment

cgwalters commented Feb 16, 2018 • edited Loading

rh-atomic-bot commented Mar 5, 2018

giuseppe commented Mar 5, 2018

cgwalters commented Mar 5, 2018

cgwalters commented Mar 5, 2018

giuseppe commented Mar 5, 2018

cgwalters commented Mar 6, 2018

giuseppe commented Mar 7, 2018

cgwalters Mar 7, 2018

Choose a reason for hiding this comment

cgwalters commented Mar 7, 2018

rh-atomic-bot commented Mar 7, 2018

alexlarsson commented Mar 29, 2018

cgwalters commented Feb 13, 2018 •

edited

Loading

cgwalters commented Feb 13, 2018 •

edited

Loading

cgwalters commented Feb 16, 2018 •

edited

Loading