Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract .zip files with recorded file timestamps (#15268) #15269

Open
wants to merge 1 commit into
base: develop2
Choose a base branch
from

Conversation

iskunk
Copy link
Contributor

@iskunk iskunk commented Dec 13, 2023

Changelog: (Fix): Extract .zip files with recorded file timestamps

Proposed fix for #15268.

Timestamps need to be set after directories have been populated, because directory timestamps are updated whenever a new file is added to them.

Note that the Windows side of things is not yet addressed. I'm not sure what, if anything, needs to be done differently there. If I can get confirmation that the same approach is workable, I can update this change to use the same file_timestamps array for both sides.

Copy link
Member

@memsharded memsharded left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have read #15268, but I still have some concerns, I'd need to understand better what would be failing because of unzipping with the current timestamps. Unzipping a zip downloaded from Github with the OS sets the modified time to the current time, it doesn't set the modified time to anything else. We have had problems in the past because the timestamps from a downloaded zip were in the future, while the current behavior (which seems standard one) of assigning the current time, has given no problems so far.

for file_ in zip_info:
extracted_size += file_.file_size
print_progress(extracted_size, uncompress_size)
try:
z.extract(file_, full_path)
file_path = os.path.join(full_path, file_.filename)
ts = time.mktime(file_.date_time + (0, 0, -1))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this 0, 0, -1? This would need some comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

file_.date_time provides six timestamp fields, but time.mktime() needs three additional fields to make a struct_time.

I can add a comment to this effect.

except Exception as e:
output.error("Error extract %s\n%s" % (file_.filename, str(e)))
for file_path, ts in file_timestamps:
os.utime(file_path, (ts, ts))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the z.extract() will not automatically handle the zip timestamp? It doesn't make much sense that this has to be handled manually?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beats me. Seems like quite the oversight for the API.

Apparently not even .extractall() gets you that.

@iskunk
Copy link
Contributor Author

iskunk commented Dec 13, 2023

Hi @memsharded,

I have read #15268, but I still have some concerns, I'd need to understand better what would be failing because of unzipping with the current timestamps.

Reproducible builds. Remember that the SOURCE_DATE_EPOCH mechanism I'm developing uses the latest timestamp present in the source archive.

Unzipping a zip downloaded from Github with the OS sets the modified time to the current time, it doesn't set the modified time to anything else.

The specific source timestamps used are less important than that they be consistent every time the package is built.

We have had problems in the past because the timestamps from a downloaded zip were in the future, while the current behavior (which seems standard one) of assigning the current time, has given no problems so far.

We can clamp the timestamps to the current time, but isn't this issue also possible with tarballs? I didn't see any code to guard against it in untargz().

That aside, extracting an archive with current timestamps is definitely not normal, expected behavior. No archiving program that I'm aware of does this by default.

@iskunk iskunk force-pushed the feature/zip-file-times branch from 16b797e to 4997dd7 Compare March 8, 2024 20:54
@iskunk
Copy link
Contributor Author

iskunk commented Mar 8, 2024

Rebased the commit.

@memsharded, do you have any further concerns? This just brings the zip extractor into parity with the tar one; there isn't anything novel here.

@memsharded memsharded changed the base branch from release/2.0 to develop2 March 11, 2024 22:21
@memsharded
Copy link
Member

Thanks for the ping.

I have reviewed this again, and I think this was not merged yet because it is still a bit concerning to me.
It seems correct and kind of expected, but from my previous experience this is the kind of thing that is likely to break other users, for several reasons, like zip files with wrong timestamps in origin, that result in dates in the future in disk, then things don't compile because it detects future dates.

For example from https://stackoverflow.com/questions/9813243/extract-files-from-zip-file-and-retain-mod-date

The ZIP format is ancient and doesn't have any concept of time zone or DST. Thus the use of mktime (inverse localtime) here. Yes, this means that files from a server running on UTC may be in the future when unzipped on a system running in the western hemisphere....

In other words, why is it that the Python zipfile doesn't handle it already? Not even as an opt-in? This is very suspicious to me. I have been programming in Python for a long time, and I have learned that many times that the Python lib is not doing something there are good reasons or unexpected issues awaiting.

I am asking for extra feedback from the team.

@memsharded memsharded added this to the 2.2.0 milestone Mar 11, 2024
@iskunk
Copy link
Contributor Author

iskunk commented Mar 12, 2024

like zip files with wrong timestamps in origin, that result in dates in the future in disk, then things don't compile because it detects future dates.

Couldn't that be an issue in tarballs as well? Guarding against future dates may be beneficial, though in my experience, that's more a protection against user error than anything having to do with security.

The ZIP format is ancient and doesn't have any concept of time zone or DST.

The .zip format is not great, but given that it's what GitHub serves up when you download a repo snapshot---which I've had to do for many projects that don't put out proper releases---it needs to be handled on par with tarballs.

I have learned that many times that the Python lib is not doing something there are good reasons or unexpected issues awaiting.

That's a fair point, but keep in mind that the Python folks have a much bigger set of use cases to deal with than we do.

@memsharded
Copy link
Member

Couldn't that be an issue in tarballs as well? Guarding against future dates may be beneficial, though in my experience, that's more a protection against user error than anything having to do with security.

The problem is that this will happen. Conan usage is very large, and in many different use cases, and users doing very different things. My concern is that there will be users building things from zips created elsewhere, that might not be fully perfect zipped, but it is working for them, and if we introduce this, their builds will start to fail, and they will be blocked until the next release we revert this. This has happened us in the past for even more unlikely changes, the Hyrum's law has hit us long time ago...

The .zip format is not great, but given that it's what GitHub serves up when you download a repo snapshot---which I've had to do for many projects that don't put out proper releases---it needs to be handled on par with tarballs.

Sure, I am not concerned about Github, I am concerned about the myriad of other source origins, including zip files created in an archaic workstation decades ago.

That's a fair point, but keep in mind that the Python folks have a much bigger set of use cases to deal with than we do.

Even more concerning to me :) If they haven't made this even an opt-in, this is an even stronger signal!

I have been reviewing this with the team, and we have realized there is already a keep_permission=False opt-in in the function, for similar reasons. We would agree to move this forward as an opt-in feature instead of a bug-fix, with a keep_timestamps=False new argument to the unzip() function and condition the new behavior to it. What do you think?

@iskunk
Copy link
Contributor Author

iskunk commented Mar 12, 2024

but it is working for them, and if we introduce this, their builds will start to fail

Builds will fail with timestamps not being preserved---packages using GNU Autotools are an easy example of this. And as I mentioned before, this completely wrecks reproducible builds.

If breakage can occur both ways, I think you're better off handling timestamps in a manner consistent with other tools. If wonky timestamps are a concern, then add a layer of sanitization. If some users are unknowingly depending on incorrect timestamp handling to let their builds work, we're doing them no favors by keeping them ignorant.

Sure, I am not concerned about Github, I am concerned about the myriad of other source origins, including zip files created in an archaic workstation decades ago.

How often is such a scenario going to arise, compared to GitHub and the like? This is optimizing for an edge case, instead of the common one.

Even more concerning to me :) If they haven't made this even an opt-in, this is an even stronger signal!

You can read the documentation they've provided. The only security issue noted is files being created outside the extraction area (look for "Never extract archives from untrusted sources ..."). If they intended to give a signal, I'm pretty sure they would have written it down.

We would agree to move this forward as an opt-in feature instead of a bug-fix, with a keep_timestamps=False new argument to the unzip() function and condition the new behavior to it. What do you think?

If I can enable it in global.conf so I don't have to think about it again, then sure, that would solve my problem. I don't think it's the right path for the project, but I've made my case.

This behavior not being the default is going to be hella awkward for #14480, and reproducible build support generally. If the source archive is a .zip file, then the build will just break until timestamps are preserved. (Again, this is how I first ran into the issue; my alpha implementation of SOURCE_DATE_EPOCH support bailed out at the too-new timestamps.)

And lastly, there is an additional implementation issue with .zip timestamps: these are specified in an undefined local time, not UTC. So to keep things general, you'd need an optional time zone parameter. (See the unzip(1) man page, option -f.) Ideally, also specifiable in global.conf.

@memsharded
Copy link
Member

Builds will fail with timestamps not being preserved---packages using GNU Autotools are an easy example of this. And as I mentioned before, this completely wrecks reproducible builds.

If breakage can occur both ways, I think you're better off handling timestamps in a manner consistent with other tools. If wonky timestamps are a concern, then add a layer of sanitization. If some users are unknowingly depending on incorrect timestamp handling to let their builds work, we're doing them no favors by keeping them ignorant.

I am afraid this is not how the commitment to stability works. The current implementation has been working without issues, not breaking users, for some time and at large scale. It is not that breakage "can occur", the current implementation doesn't fail. Yes, it might not produce reproducible builds, but it doesn't crash or abort because of some zip archive format or timestamps. If we were getting reports of things like that, we would fix it, but we haven't.

But doing this change, and having some user reporting some breakage should be considered a regression and reverted, we will not tell them: "you should fix your zip archives", because maybe they don't control those archives or other reasons.

How often is such a scenario going to arise, compared to GitHub and the like?

Well, we have many thousands of users already. I have already been there a few times, and we are the ones getting the calls from upset users because the latest Conan release broke their builds in unexpected ways. Often it will not even be visible in Github, because the larger the company the less likely they are to use Github to report and they will escalate the issue in other ways, requiring even doing video calls, sometimes taking hours to pair-debug and understand that their build failed because of that apparently correct change. Maybe it is only a 0.1% of cases, but that still means quite a few users being affected and reporting to us. We don't break users just because they are not the majority.

This is optimizing for an edge case, instead of the common one.

It is not an optimization, it is the Python default, which has massive usage. Not doing always the timestamp thing isn't an optimization, because it is working already correctly at scale. The edge case is adding the extra functionality to keep those timestamps in zip format.

This behavior not being the default is going to be hella awkward for #14480, and reproducible build support generally. If the source archive is a .zip file, then the build will just break until timestamps are preserved. (Again, this is how I first ran into the issue; my alpha implementation of SOURCE_DATE_EPOCH support bailed out at the too-new timestamps.)

One of the consequences of using a package manager is that reproducible builds are not that critical. Once a binary is build from some source for a given configuration, the devops best practices recommend that it never should be built from source again. So while it is great to be able to do reproducible builds, the usage of a package manager that tracks the sources unicity with recipe revisions and binary unicity with package-id reduces the needs to have reproducible builds. I know reproducible builds is important for some users, but beyond all the support tickets and pull requests we process we also do several calls per week with different users from many companies, from startups to many fortune 100 companies. And reproducible builds is very far away from being a common concern, not to say the vast majority of users doesn't really care.

I am not saying it is not important, just summarizing the feedback we see at scale from many different users, and we need to evaluate and balance the needs, use cases and constraints of all users.
And considering all together, we are still convinced that this is not a bug fix, but a feature, and as it has some not negligible risks of breaking, it has to be introduced as an opt-in, it cannot be introduced as default.

And lastly, there is an additional implementation issue with .zip timestamps: these are specified in an undefined local time, not UTC. So to keep things general, you'd need an optional time zone parameter. (See the unzip(1) man page, option -f.) Ideally, also specifiable in global.conf.

But this would more easily break again reproducible builds, because the conf can change more easily. While defining what is necessary in the conanfile.py would actually freeze those values (by the recipe-revision), so it will guarantee that exactly the same time is used always, no matter what the configuration is.

If I can enable it in global.conf so I don't have to think about it again, then sure, that would solve my problem. I don't think it's the right path for the project, but I've made my case.

We can consider this, but this doesn't feel right at first sight either, as if the value can change, it is actually possible to get a package successfully built in one run, because the conf is not yet active, and then failing when activating the conf.

I'll discuss this possibility with the team.

@iskunk
Copy link
Contributor Author

iskunk commented Mar 13, 2024

It is frustrating to articulate concrete issues with the current handling of .zip file timestamps, and see them dismissed on grounds of breakage that might occur under extremely narrow circumstances that do not even represent typical Conan usage. It's one thing to bend over backwards to accommodate unusual corner cases, but quite another to do so at the expense of reasonable operation for everyone else.

Yes, it might not produce reproducible builds, but it doesn't crash or abort because of some zip archive format or timestamps. If we were getting reports of things like that, we would fix it, but we haven't.

Do I need to add an example of build breakage to #15268? I can easily put one together.

we will not tell them: "you should fix your zip archives", because maybe they don't control those archives or other reasons.

You could provide options for sanitizing/normalizing those archives. But then, how far into the weeds do you want to go to support non-standard files? Is any arbitrary .zip supposed to work? Remember that some .zip files use compression algorithms that are only implemented on Windows, for example. Repacking is going to be necessary in some circumstances.

We don't break users just because they are not the majority.

It would be one thing if the current timestamp handling were documented, or otherwise part of an interface contract. Otherwise, it's within the realm of things that can change, possibly without warning. I've had stuff break when I pull in a new version of Conan. But at least in the last instance, I didn't squawk about it, because I was using internal date-formatting routines---and relying on those was the mistake, not Conan shifting its internals around.

It is not an optimization, it is the Python default,

The "common case" I was referring to was that timestamps are preserved when a source archive is unpacked. Why Python made their zipfile API that way, I do not know, but that is not relevant to the point of whether ignoring timestamps is reasonable in a source-archive context---it is not. The most you could argue here is that I'm the first user to actually complain about it.

One of the consequences of using a package manager is that reproducible builds are not that critical.

How do you know that your package-build toolchain hasn't been compromised? Reproducible builds allow you the option of remaking the infrastructure from the ground up and testing whether you still get the same output. A package manager provides many benefits for a build toolchain, but it doesn't touch the "Trusting Trust" problem.

And reproducible builds is very far away from being a common concern, not to say the vast majority of users doesn't really care.

To be sure, it's ahead of where the industry is right now. One factor in making it mainstream is having tools that facilitate that use case. As an example, we're in a very different place now with Docker et al. compared to "golden image" workflows from back in the day.

But this would more easily break again reproducible builds, because the conf can change more easily. While defining what is necessary in the conanfile.py would actually freeze those values (by the recipe-revision), so it will guarantee that exactly the same time is used always, no matter what the configuration is.

Note that reproducible builds are only applicable to the specific circumstances of a user or user org. There is no way that a conanfile can be written such that everyone gets the same build, because it doesn't specify the build environment (what container, what compiler, what flags?) that has a significant effect on the outputs. So there is little benefit in having the conanfile nail down the timestamps.

But it is important that whatever timestamps you do get, they should remain consistent over time. A build six months from now should be the same as a build today. You do need to have the same config, of course (like global.conf and the Dockerfile for your build container) or else you're in a completely different world.

I would probably set the config option to always interpret .zip timestamps w.r.t. UTC. The alternative would be to use the time zone of my org's HQ, but that has Daylight Savings, and it's conceptually simpler to not deal with that.

(Note that leaving the time zone unconfigured would presumably default to using the system/local time zone, as that is what DOS timestamps officially refer to. But that then leaves the door open to inconsistencies when running on systems in a different time zone, or even misconfigured time zones.)

We can consider this, but this doesn't feel right at first sight either, as if the value can change, it is actually possible to get a package successfully built in one run, because the conf is not yet active, and then failing when activating the conf.

In my context, if you don't start with a conan config install of a standard org-wide Conan config, nothing will work. Running without said config is unsupported. The reason why I want the option in global.conf is not just because that is a good place to put high-level policy-relevant settings, but also because that file will functionally always be there---hence the "don't have to think about it again" bit.

@memsharded memsharded modified the milestones: 2.2.0, 2.3.0 Mar 18, 2024
@czoido czoido modified the milestones: 2.3.0, 2.4.0 Apr 29, 2024
@memsharded memsharded modified the milestones: 2.4.0, 2.X Jun 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants