Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mutate.Time should be lazy #1168

Open
imjasonh opened this issue Nov 5, 2021 · 5 comments
Open

mutate.Time should be lazy #1168

imjasonh opened this issue Nov 5, 2021 · 5 comments
Labels

Comments

@imjasonh
Copy link
Collaborator

imjasonh commented Nov 5, 2021

pkg/v1/mutate.Time sets all timestamps in an image to the given time, typically to make the image more reproducible. For timestamps in layers, it calls layerTime which reads the layer contents and writes them to a buffered tar.Writer. This means that if the layer is huge, we may run out of memory buffering its contents.

We should consider making this a lazy transformation, so that layer mutations aren't made until they're read. We did a similar thing in pkg/v1/cache, where layer contents aren't cached until they're read via Compressed or Uncompressed.

@imjasonh imjasonh added the good first issue Good for newcomers label Nov 5, 2021
@jonjohnsonjr
Copy link
Collaborator

At one point I experimented with this. The API was gross due to streaming layers, but that was so long ago there might be a less naive way to do it: jonjohnsonjr@542d74b

@imjasonh
Copy link
Collaborator Author

imjasonh commented Nov 8, 2021

Yeah I like this Mutation concept. It doesn't help that much with mutate.Time though since that wants to deal with files in the tar archive, and that's still a huge pain to extract from the v1.Image, especially streamily.

Maybe we could expose some adapter that takes a func(tar.Header) (tar.Header, error)?

@jonjohnsonjr
Copy link
Collaborator

For the layerTime, I ended up having a ReadCloser -> ReadCloser: jonjohnsonjr@542d74b#diff-05a233d3122c90ba0cf547c623e3a34194cf56c4ff05cdf414d5a265300b31e6R39

@github-actions
Copy link

github-actions bot commented Feb 7, 2022

This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Keep fresh with the 'lifecycle/frozen' label.

@github-actions github-actions bot closed this as completed Mar 9, 2022
@imjasonh imjasonh reopened this Mar 9, 2022
zx96 added a commit to zx96/kaniko that referenced this issue Jan 2, 2023
This change adds a new flag to zero timestamps in layer tarballs without
making a fully reproducible image.

My use case for this is maintaining a large image with build tooling.
I have a multi-stage Dockerfile that generates an image containing
several toolchains for cross-compilation, with each toolchain being
prepared in a separate stage before being COPY'd into the final image.
This is a very large image, and while it's incredibly convenient for
development, making a change as simple as adding one new tool tends to
invalidate caches and force the devs to download another 10+ GB image.

If timestamps were removed from each layer, these images would be mostly
unchanged with each minor update, greatly reducing disk space needed for
keeping old versions around and time spent downloading updated images.

I wanted to use Kaniko's --reproducible flag to help with this, but ran
into issues with memory consumption (GoogleContainerTools#862) and build time (GoogleContainerTools#1960).
Additionally, I didn't really care about reproducibility - I mainly
cared about the layers having identical contents so Docker could skip
pulling and storing redundant layers from a registry.

This solution works around these problems by stripping out timestamps as
the layer tarballs are built. It removes the need for a separate
postprocessing step, and preserves image metadata so we can still see
when the image itself was built.

An alternative solution would be to use mutate.Time much like Kaniko
currently uses mutate.Canonical to implement --reproducible, but that
would not be a satisfactory solution for me until
[issue 1168](google/go-containerregistry#1168)
is addressed by go-containerregistry. Given my lack of Go experience, I
don't feel comfortable tackling that myself, and this seems like a
simple and useful workaround in the meantime.

As a bonus, I believe that this change also fixes GoogleContainerTools#2005 (though that
should really be addressed in go-containerregistry itself).
zx96 added a commit to zx96/kaniko that referenced this issue Apr 22, 2023
This change adds a new flag to zero timestamps in layer tarballs without
making a fully reproducible image.

My use case for this is maintaining a large image with build tooling.
I have a multi-stage Dockerfile that generates an image containing
several toolchains for cross-compilation, with each toolchain being
prepared in a separate stage before being COPY'd into the final image.
This is a very large image, and while it's incredibly convenient for
development, making a change as simple as adding one new tool tends to
invalidate caches and force the devs to download another 10+ GB image.

If timestamps were removed from each layer, these images would be mostly
unchanged with each minor update, greatly reducing disk space needed for
keeping old versions around and time spent downloading updated images.

I wanted to use Kaniko's --reproducible flag to help with this, but ran
into issues with memory consumption (GoogleContainerTools#862) and build time (GoogleContainerTools#1960).
Additionally, I didn't really care about reproducibility - I mainly
cared about the layers having identical contents so Docker could skip
pulling and storing redundant layers from a registry.

This solution works around these problems by stripping out timestamps as
the layer tarballs are built. It removes the need for a separate
postprocessing step, and preserves image metadata so we can still see
when the image itself was built.

An alternative solution would be to use mutate.Time much like Kaniko
currently uses mutate.Canonical to implement --reproducible, but that
would not be a satisfactory solution for me until
[issue 1168](google/go-containerregistry#1168)
is addressed by go-containerregistry. Given my lack of Go experience, I
don't feel comfortable tackling that myself, and this seems like a
simple and useful workaround in the meantime.

As a bonus, I believe that this change also fixes GoogleContainerTools#2005 (though that
should really be addressed in go-containerregistry itself).
zx96 added a commit to zx96/kaniko that referenced this issue Apr 22, 2023
This change adds a new flag to zero timestamps in layer tarballs without
making a fully reproducible image.

My use case for this is maintaining a large image with build tooling.
I have a multi-stage Dockerfile that generates an image containing
several toolchains for cross-compilation, with each toolchain being
prepared in a separate stage before being COPY'd into the final image.
This is a very large image, and while it's incredibly convenient for
development, making a change as simple as adding one new tool tends to
invalidate caches and force the devs to download another 10+ GB image.

If timestamps were removed from each layer, these images would be mostly
unchanged with each minor update, greatly reducing disk space needed for
keeping old versions around and time spent downloading updated images.

I wanted to use Kaniko's --reproducible flag to help with this, but ran
into issues with memory consumption (GoogleContainerTools#862) and build time (GoogleContainerTools#1960).
Additionally, I didn't really care about reproducibility - I mainly
cared about the layers having identical contents so Docker could skip
pulling and storing redundant layers from a registry.

This solution works around these problems by stripping out timestamps as
the layer tarballs are built. It removes the need for a separate
postprocessing step, and preserves image metadata so we can still see
when the image itself was built.

An alternative solution would be to use mutate.Time much like Kaniko
currently uses mutate.Canonical to implement --reproducible, but that
would not be a satisfactory solution for me until
[issue 1168](google/go-containerregistry#1168)
is addressed by go-containerregistry. Given my lack of Go experience, I
don't feel comfortable tackling that myself, and this seems like a
simple and useful workaround in the meantime.
@bh-tt
Copy link

bh-tt commented Oct 23, 2024

I've improved this partially in GoogleContainerTools/kaniko#3347, but I wonder if I should submit the PR here? I see the value in using lazy logic here, that is likely a better method than what I've requested in kaniko. Would you be open to a PR implementing the lazy transformation much like @jonjohnsonjr had?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants