Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure all layers are here when tagging, committing, saving, or converting #3435

Merged
merged 6 commits into from
Oct 2, 2024

Conversation

apostasie
Copy link
Contributor

@apostasie apostasie commented Sep 18, 2024

@apostasie apostasie force-pushed the bug-3425-missing-layer branch from 8c55dbd to 82daf62 Compare September 18, 2024 19:23
@@ -57,7 +57,7 @@ func Commit(ctx context.Context, client *containerd.Client, rawRef string, req s
if found.MatchCount > 1 {
return fmt.Errorf("multiple IDs found with provided prefix: %s", found.Req)
}
imageID, err := commit.Commit(ctx, client, found.Container, opts)
imageID, err := commit.Commit(ctx, client, found.Container, opts, options.GOptions)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be better to add GOptions to the commit opts struct, but this is good enough short term.

}
refDomain := distributionref.Domain(named)

// Get a resolver
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should put that away somewhere with the login code refactor. Resolver is repetitive and error prone...

@apostasie apostasie force-pushed the bug-3425-missing-layer branch 2 times, most recently from 58a9dd3 to 6aa644c Compare September 18, 2024 19:36
@apostasie apostasie changed the title Ensure all layers are here when tagging Ensure all layers are here when tagging or committing Sep 18, 2024
@apostasie apostasie marked this pull request as draft September 18, 2024 20:13
@apostasie apostasie changed the title Ensure all layers are here when tagging or committing [WIP] Ensure all layers are here when tagging or committing Sep 18, 2024
@apostasie apostasie force-pushed the bug-3425-missing-layer branch 14 times, most recently from ccdb533 to d2905da Compare September 18, 2024 23:23
@apostasie apostasie marked this pull request as ready for review September 19, 2024 00:11
@apostasie apostasie changed the title [WIP] Ensure all layers are here when tagging or committing Ensure all layers are here when tagging or committing Sep 19, 2024
@apostasie
Copy link
Contributor Author

@AkihiroSuda can you poke the CI?

Comment on lines 69 to 73
err = imgutil.EnsureAllContent(ctx, client, srcName, img.Target, "", options.GOptions)
if err != nil {
return err
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When tagging, do we need to return errors for failed pulls? This might not meet expectations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is a valid concern.
Then if at a later point we try to push it, it will fail…
Not sure how to deal with that…

Copy link
Contributor Author

@apostasie apostasie Sep 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not like any of the code involved in there.
The "single-platform tmp-reduced-platform image" gymnastic / convert intermediary step on push just smells fishy.
Then again, this is containerd design issues on display here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could introduce a new flag called ensure-image for both the tag and commit commands.

When this flag is set, we would attempt to pull any missing layers. If any errors occur during this process, they would be returned.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing that is bugging me with the new flag approach is that it would leave the default behavior broken, which I don't think is right.

So, what about this: instead of hard error when the fetch fail, we just print a loud warning but we continue?
wdyt @lingdie ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Latest push implements this.
We no longer hard error, but warn the user.
PTAL at your convenience @lingdie

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

look good to me.

@apostasie apostasie force-pushed the bug-3425-missing-layer branch 2 times, most recently from 297233b to c64c494 Compare September 19, 2024 22:47
@apostasie apostasie requested a review from lingdie September 19, 2024 22:48
@apostasie apostasie force-pushed the bug-3425-missing-layer branch from d5e41e8 to 764a2aa Compare September 25, 2024 04:30
@apostasie
Copy link
Contributor Author

apostasie commented Sep 25, 2024

CI is green (save for the windows regression #3437).

@AkihiroSuda This is how far I will go with this.

I already hate this PR, and there are still glaring issues to fix in all that stuff, especially the whole tmp-reduced-platform contorsions, and overall most things touching platform code (which definitely stems from containerd btw - just staring at stevvooe's 10 years old code and comments in that regard make my eyes bleed).

Anyhow, I still believe this PR here will help a lot, as the issue being fixed has been one of our most pervasive and egregious.

Credit for this fix should go to @lingdie who did the research and correctly identified the underlying containerd issue, then pointed me in the right direction.

PTAL at your convenience.

@apostasie apostasie marked this pull request as ready for review September 25, 2024 05:20
@AkihiroSuda AkihiroSuda added this to the v2.0.0 milestone Sep 25, 2024
@AkihiroSuda AkihiroSuda self-requested a review September 25, 2024 10:52
@@ -55,26 +52,45 @@ func TestKubeCommitPush(t *testing.T) {
cmd := testutil.KubectlHelper(base, "get", "pods", tID, "-o", "jsonpath={ .status.containerStatuses[0].containerID }")
cmd.Run()
containerID = strings.TrimPrefix(cmd.Out(), "containerd://")

// This below is missing configuration to allow for plain http communication
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was not able / interested enough to finish setting up a cluster registry.
This here is a start, and can be completed later on, either by me or someone else.

}

tearDown := func() {
testutil.KubectlHelper(base, "delete", "pod", "-f", tID).Run()
testutil.KubectlHelper(base, "delete", "pod", "--all").Run()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason that escapes me, kubectl was not cleaning up with the pod name

Err: "failed to create a tmp single-platform image",
})
base.Cmd("commit", containerID, "testcommitsave").AssertOK()
base.Cmd("save", "testcommitsave").AssertOK()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This correctly reproduce issue #827 I believe.

)

func TestImageInspectContainsSomeStuff(t *testing.T) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This turned out flaky as I was debugging this new code.
Rewriting against the new toolkit to fix that.

}

func TestImageInspectDifferentValidReferencesForTheSameImage(t *testing.T) {
testutil.DockerIncompatible(t)
nerdtest.Setup()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as the other test.

@@ -125,7 +125,8 @@ func TestImagesFilterDangling(t *testing.T) {
testutil.RequiresBuild(t)
testutil.RegisterBuildCacheCleanup(t)
base := testutil.NewBase(t)
base.Cmd("images", "prune", "--all").AssertOK()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This (obviously?) was doing nothing and leftovers from other tests were having an impact here.
Because we depend on build here, we cannot use a private namespace, so, making sure we clean house.

@@ -0,0 +1,133 @@
/*
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the underlying issue is impacting a lot of different functions, it did not feel right to scatter it over in different commands.
Suggesting instead to have a new /issues/ folder for these type of stuff (eg: regression testing for complex / multi-command issues)

@@ -23,6 +23,7 @@ readonly root

GO_VERSION=1.23
KIND_VERSION=v0.24.0
CNI_PLUGINS_VERSION=v1.5.1
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somehow when I came up with the Kube tooling, I did not test networking.
We (obviously?) do need cni here for nerdctl to work outside of kube itself.

@@ -31,7 +32,7 @@ import (
func Tag(ctx context.Context, client *containerd.Client, options types.ImageTagOptions) error {
imageService := client.ImageService()
var srcName string
imagewalker := &imagewalker.ImageWalker{
walker := &imagewalker.ImageWalker{
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not shadow the package name.

// Ensure all the layers are here: https://github.com/containerd/nerdctl/issues/3425
err = EnsureAllContent(ctx, client, srcName, options.GOptions)
if err != nil {
log.G(ctx).Warn("Unable to fetch missing layers before committing. " +
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On tag, we do not hard error if we fail to retrieve the missing layers.
This is to preserve the previous behavior, in case we are offline.
Warn the user instead, and point it to a special purpose ticket that provides explanations.

@@ -126,6 +127,13 @@ func Commit(ctx context.Context, client *containerd.Client, container containerd
return emptyDigest, err
}

// Ensure all the layers are here: https://github.com/containerd/nerdctl/issues/3425
err = image.EnsureAllContent(ctx, client, baseImg.Name(), globalOptions)
if err != nil {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as tag, we do not hard error on commit.

// Ensure all the layers are here: https://github.com/containerd/nerdctl/issues/3425
err = EnsureAllContent(ctx, client, srcRawRef, options.GOptions)
if err != nil {
return err
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Convert and Save DO hard error, since the operation cannot complete if there is missing content.

@apostasie
Copy link
Contributor Author

Added some inline comments to ease review.

This was referenced Sep 25, 2024
@apostasie
Copy link
Contributor Author

@AkihiroSuda with containerd rc5 just out - it would be nice to have this one in for the next nerdctl rc so that we can get it some soak time.

Copy link
Member

@AkihiroSuda AkihiroSuda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

@AkihiroSuda AkihiroSuda merged commit b943664 into containerd:main Oct 2, 2024
22 checks passed
@ChrisBr
Copy link

ChrisBr commented Dec 13, 2024

@apostasie we seem to run into this issue when using the GKE image streaming snapshotter. When we commit an image locally (e.g. with docker) and don't push it, the converter tries to pull the image from remote. Is this expected?

Example

> docker run --name foo alpine
> docker commit foo gcr.io/$repo/cbruckmayer:foo

# I can use the image with nerdctl
> nerdctl --address /var/run/docker/containerd/containerd.sock --namespace moby --snapshotter gcfs run -it --network none gcr.io/$repo/cbruckmayer:foo

# However, when trying to convert the image, it will try to pull it from remote which fails with not found

> nerdctl --address /var/run/docker/containerd/containerd.sock --namespace moby --snapshotter gcfs image convert --zstd --oci gcr.io/$repo/cbruckmayer:foo gcr.io/$repo/cbruckmayer:foo-zstd
gcr.io/$repo/cbruckmayer:foo: resolving      |--------------------------------------|
elapsed: 0.1 s                                total:   0.0 B (0.0 B/s)
INFO[0000] fetch failed after status: 404 Not Found      host=gcr.io
FATA[0000] failed to resolve reference "gcr.io/$repo/cbruckmayer:foo": gcr.io/$repo/cbruckmayer:foo: not found
root@docker-daemon-rzhtc:/app# nerdctl --debug-full --address /var/run/docker/containerd/containerd.sock --namespace moby --snapshotter gcfs image convert --zstd --oci gcr.io/$repo/cbruckmayer:foo gcr.io/$repo/cbruckmayer:foo-zstd
DEBU[0000] fetching                                      image="gcr.io/$repo/cbruckmayer:foo"
DEBU[0000] resolving                                     host=gcr.io
DEBU[0000] do request                                    host=gcr.io request.header.accept="application/vnd.docker.distribution.manifest.v2+json, application/vnd.docker.distribution.manifest.list.v2+json, application/vnd.oci.image.manifest.v1+json, application/vnd.oci.image.index.v1+json, */*" request.header.user-agent=containerd/2.0.0+unknown request.method=HEAD url="https://gcr.io/v2/$repo/cbruckmayer/manifests/foo"
DEBU[0000] fetch response received                       host=gcr.io response.header.accept-ranges=none response.header.content-type=application/json response.header.date="Fri, 13 Dec 2024 13:10:25 GMT" response.header.docker-distribution-api-version=registry/2.0 response.header.server="Docker Registry" response.header.vary=Accept-Encoding response.header.www-authenticate="Bearer realm=\"https://gcr.io/v2/token\",service=\"gcr.io\",scope=\"repository:$repo/cbruckmayer:pull\"" response.header.x-frame-options=SAMEORIGIN response.header.x-xss-protection=0 response.status="401 Unauthorized" url="https://gcr.io/v2/$repo/cbruckmayer/manifests/foo"
DEBU[0000] Unauthorized                                  header="Bearer realm=\"https://gcr.io/v2/token\",service=\"gcr.io\",scope=\"repository:$repo/cbruckmayer:pull\"" host=gcr.io
DEBU[0000] do request                                    host=gcr.io request.header.accept="application/vnd.docker.distribution.manifest.v2+json, application/vnd.docker.distribution.manifest.list.v2+json, application/vnd.oci.image.manifest.v1+json, application/vnd.oci.image.index.v1+json, */*" request.header.user-agent=containerd/2.0.0+unknown request.method=HEAD url="https://gcr.io/v2/$repo/cbruckmayer/manifests/foo"
DEBU[0000] fetch response received                       host=gcr.io response.header.accept-ranges=none response.header.content-type=application/json response.header.date="Fri, 13 Dec 2024 13:10:25 GMT" response.header.docker-distribution-api-version=registry/2.0 response.header.server="Docker Registry" response.header.vary=Accept-Encoding response.header.x-frame-options=SAMEORIGIN response.header.x-xss-protection=0 response.status="404 Not Found" url="https://gcr.io/v2/$repo/cbruckmayer/manifests/foo"
INFO[0000] fetch failed after status: 404 Not Found      host=gcr.io
FATA[0000] failed to resolve reference "gcr.io/$repo/cbruckmayer:foo": gcr.io/shopify-docker-images/cbruckmayer:foo: not found

@apostasie
Copy link
Contributor Author

@ChrisBr
Unfortunately, yes, this is expected, as it is necessary to mitigate the unforeseen effects of containerd "lazy" pulling.

Though, in this case, maybe we can relax to a soft error if the pull fails.

Do you mind opening a new issue with these details?
I'll look into it ASAP.

@ChrisBr
Copy link

ChrisBr commented Dec 16, 2024

@apostasie #3764.

Thanks for looking into it, really appreciated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment