-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error: invalid central directory file header signature when unzipping big files, zipped by mac os #69
Comments
Thanks for the report. Is it possible to provide a download for a zip file that reproduces this problem? You mention it happens with very large files, like 8GB, so perhaps hosting a download would be infeasible. Alternatively, could you provide minimal steps to reproduce the issue? Perhaps something like:
I don't have a Mac, so i'd be forwarding these instructions to a friend who can transfer me an 8GB file easily. |
An easy answer is: yes, I have a clue. The Mac Archive Utility is a piece of trash when it comes to the zip file format. This will be the fourth (I believe) compatibility issue that boils down to Apple getting the zip file format embarrassingly wrong. I'm not surprised at all that this is an issue. In fact, I previously thought that Archive Utility was simply unable to handle zip files larger than 4GB due to lack of ZIP64 support. Are you sure the zip files that you create are readable by anyone? Like, can the GUI itself read its own files that it creates that are 8GB in size? If so, then there's some strange format at play that isn't ZIP64 format. |
Did some more testing and it seems to fail somewhere between 4.19 and 4.33 GB. Since I'm trying to zip 4k mp4's I used ffmpeg to create files (warning, don't watch them with your volume opened completely) https://www.ffmpeg.org/download.html this creates a 32 seconds mp4 (4.19 GB) this creates a 33 seconds mp4 (4.33 GB) The 32 seconds one unzips fine with yauzl, the 33 seconds (or any larger) one fails with: I zip them by simply going into Finder on my Mac, right clicking the file and choosing Hope this is enough to reproduce the problem, if not let me know. |
Thanks, I've reproduced the issue. Indeed, the Info-ZIP I'll do some research and tell report back here. (It may take a few weeks.) I'm curious to see what the hell Archive Utility did that it can read its own malformed zipfile but no one else can. |
I've determined what the issue is. It's pretty much the same problem as #60. When the Mac Archive Utility needs to write 32-bit values in the metadata, it doesn't check for overflow and simply writes the lower 32 bits of the value. There are 3 values that get corrupted like this in the case of archiving a file that's over 4 GB in size: the "offset of start of central directory with respect to the starting disk number" in the end of central directory record and the "compressed size" and "uncompressed size" fields of the local file header and central directory file header (and data descriptor, but that doesn't matter too much). This means that the metadata is an outright lie. This is absolutely a bug in Archive Utility, and it's a nasty one too. The correct behavior would have been to use ZIP64 format, but Archive Utility has no idea what that is. There may or may not be any way to correctly and reliably recover from this corruption. I'm still thinking about it. |
Thank you for looking into it. Also I noticed that sometimes the unzip CLI tries and unzips part of the file but they you lose half of the video file and it's completely not reliable. Hope you do find a solution, but in the meantime my workaround is just avoiding the Archive Utility. I did send a report to apple, if they get back to me with any relevant information I will share that here as well. |
Oh. How did you do that? I was unable to find any bug tracker for either Archive Utility or the If Archive Utility's zipfile implementation were open source, i could even make a pull request fixing their bugs, but I didn't find that either. |
I reported it here: https://www.apple.com/feedback/macos.html |
@d-mg Out of interest, what version of Mac OS did you create the faulty zip file on? Absolutely bonkers that Mac OS has such a broken ZIP implementation, especially as it's meant to cater for "creatives" who are the most likely people to be zipping up large (e.g. video) files. @thejoshwolfe Are you saying that Mac OS Archive Utility does not make use of ZIP64 extra fields at all? |
If the number of files in the ZIP is not enormous, I guess that "offset of start of central directory with respect to the starting disk number" can be deduced from the total length of the ZIP file and lower 32 bits of the offset recorded by MacOS. Each central directory file header is max 196654 bytes (46 + 65536 + 65536 + 65536). So if the ZIP contains less than 21000 files (approx), the central directory cannot be bigger than 4GiB. So then the low 32 bits are enough to determine the location of start of the central directory. That doesn't solve the problem of "compressed size" and "uncompressed size" (and also, presumably, the "relative offset of local file header") fields also being truncated to 32 bit. But I guess you could use the info that is available to step through the file in 4 GiB jumps to check all possible locations for matching local file headers. Ergh! But do you think this would work? My work involves receiving a lot of large ZIP files from sources all over the place, and I sadly have no way to prevent them being created with Mac OS Archive Utility (I suspect most are), so I would be motivated to help solve this problem. I also have a massive collection of large ZIP files from varied sources to test an implementation on. |
@overlookmotel I was using High Sierra for this |
I developed a plan with a lot of the same ideas you presented to make this work, but it's very difficult to implement. The gist is to track a non-deterministic multistate of possibilities depending on what kind of corruption the zipfile has. The user would need to explicitly request this complexity for three reasons: I don't want everyone to pay the performance cost of Apple's incompetence/hostility if they don't want to; the user needs to use the library differently when this complexity is enabled; the complexity can change the correct interpretation of a correct zipfile to be incorrect in rare cases. The Mac Archive Utility behaves predictably, albeit incorrectly, when laying out data in the zipfile, so I can check if the zipfile is deviating from the Archive Utility-style. If we know we aren't dealing with Archive Utility, we can drop the complexity and behave normally. However, we can't be certain that we are dealing with Archive Utility until we verify that one of these overflow corruptions has happened. Each case of corruption can be detected heuristically by seeking to the locations that would cause the overflow to give us the value we see, and checking the zipfile data structure signatures. This is not a perfect solution, because there may be a valid zipfile that perfectly mimics the Archive Utility corruption. That's the main problem with this Archive Utility bug is that it creates "valid" zipfiles with the wrong content. So whether we should assume that it's corrupted or correct is a guess, and that's one of the reasons why the user would have to explicitly ask for this feature to be enabled. The probability of this "false positive" detection happening randomly is low, so it's not the worst thing about this idea. The worst part is that this it overall sucks. It's really complicated, it's going to negatively impact performance at least a little bit for everyone, and it's all due to flagrant incompetence/hostility at Apple being widely adopted. My motivation is really the limiting factor for implementing a fix for this, but hearing people say that this is important to them really helps. 👍
Yes. I haven't seen the source code, because I believe it's closed source, but all of my experience and experimentation with it suggest that the implementation is oblivious to the ZIP64 format entirely, and apparently also oblivious to the concept of integer overflow and bounds checking. |
@thejoshwolfe Thanks for your lengthy reply. I agree that it totally sucks! But, at least for me, it would be really useful to resolve. Complexity: To my mind, this is the biggest downside. It would make the code considerably more complex. I imagine it would likely require a refactor to split different steps of file parsing into separate functions, so functions can call themselves recursively as they try different possibilities. That's a lot of code churn. Performance: As far as I can see, I doubt that it would impact performance in any significant way for non-Apple ZIP files. The workarounds for the Apple case, I think, only need kick in if a valid Central Directory Record cannot be found in the location it should be. So the execution path for non-Apple archives could remain the same. False positives: I hope possibility of false positives could be reduced to zero, or at least to an infinitesimal level. I hope (but don't know yet) that Archive Utility ZIP files can be "fingerprinted" in some way. For example, maybe they always set the "version made by" field to a particular value which no-one else uses. If so, then the wacky heuristics could be disabled if it's not an Archive Utility file - so anything not made by Archive Utility would behave as before, and the only files which would be parsed differently would be ones which at present yauzl can't deal with anyway. Explicit enabling: I agree this should have to be explicitly enabled by the user. I would like to do some digging on this and see what I can find. In principle, if (1) I can come up with something half-sensible which doesn't affect performance significantly or risk screwing up valid ZIP files and (2) am willing to do a lot of the work on implementation, would you likely be willing to accept the downside of additional code complexity in order to support Archive Utility files? (I totally understand if you don't) |
@d-mg Thanks for letting me know. |
To be honest, since you asked, probably not. The effort to thoroughly review a pull request of this magnitude would be on par with authoring the code from scratch. So I'd rather just write it myself, if that makes sense. There's an awful lot of careful design and consideration that needs to be done to ensure this change doesn't regress anything already in place, such as performance and API-level corner case behavior (neither of which are very well tested automatically here). But thanks for offering. Your enthusiasm is becoming my enthusiasm, which is the currency that funds this project. I'll revisit this issue in a few weeks after I work through some major non-software issues I'm in the middle of at the moment. Stay tuned for updates. |
OK, fair enough. I've got really interested in this, and was looking forward to getting my hands dirty, but hey! I really appreciate it that you're willing to implement. In case it's useful, I did a little digging into how Archive Utility structures ZIP files. Structure of ZIP filesStructure:
Fingerprinting:
All of the above hold true for both ZIP files created by graphical Archive Utility and command line Proviso: I've not checked that many files, and this is all on one machine running Mac OS Sierra 10.12.6. It's possible behavior may vary on different versions. What does this mean?The consistencies in structure should make the file easier/more reliable to parse than it might otherwise be: File Header 1 is at byte 0. File Header 2 is immediately after end of file 1 Data Descriptor, and so on. So to get the correct locations and compressed lengths of each file, it should only be necessary to seek for the Data Descriptor for each file, firstly in location where it should be, and then in 4GiB steps onward from there until the Data Descriptor is found. Data Descriptor is 16 bytes, so risk of false positives is very small. Of course, there's no way of knowing the size of uncompressed data without uncompressing it. Also, hopefully some of these attributes can be used to "fingerprint" Archive Utility's ZIP files. Overall, this makes me hopeful that implementing this will not be too much of a nightmare, and should be pretty robust in terms of avoiding screwing up valid files. What I'm going to do nextI have a large collection of ZIP files from a bunch of different sources (by accident - just haven't had time to clear them off my drives). Many of these will no doubt have been created with Archive Utility, and many are 4GB+. I also have the unzipped contents of each (unzipped with various different tools). In many cases the unzipped files are in formats which have built-in checksums so I can be confident they've been unzipped correctly. I plan to run yauzl on all of these files and see what it fails on. That'll give us a corpus of files to test the new version on once Archive Utility workarounds have been implemented. I'll let you know where I get to. NB I have no idea what the "major non-software issues" you are facing at the moment may be. But just to say, please don't feel the need to reply. I appreciate the attention you've given this issue, and it can wait until you're back in action. |
I've done some work on this. ImplementationI'm afraid I couldn't resist getting into it fully. I've made an implementation that parses faulty Mac Archive Utility ZIP files. It is here: https://github.com/overlookmotel/yauzl-mac It's not published on npm as I intended it as a proof of concept and testing ground for seeing if this could work. Obviously I hope it'll be useful to you in adding an implementation of this feature to yauzl proper. My implementation is very hacky as most of yauzl's guts are not exposed on the public API. So, for example, it has to use events-intercept to capture emitted 'entry' events and do some more work (locating the end of the file) before re-emitting the event. The guts of it are in:
It depends on a change to yauzl which exposes more details about the Central Directory on The modifications don't do anything unless TestingI've tested it out on a collection of 331 random zip files. Most are large (10GB+), and about 150 of them are Mac Archive Utility-created ZIPs which yauzl was not able to unzip. With the implementation linked to above, it can unzip all of the files except 3. All of these I've compared to the result from unzipping with command line Of the 3 failing ZIP files, 2 are seemingly Mac Archive Utility ZIP files, but they have no Central Directory at all. The files end straight after the last Data Descriptor. These are fully corrupt and I don't think yauzl should attempt to unzip them. I don't know how these files were made but I wonder if the zipping process errored out before finishing. Weird! The last failing file is a strange case. It's not a Mac Archive Utility file, but I haven't figured out what's wrong with it yet. The vast majority of the files I've tested are ZIPs containing less than 10 files. However, this implementation also successfully extracts all the files from the ZIP file linked to in #60 which has 71,609 files. Please let me know what you think when you get a chance. |
@overlookmotel awesome! I took a look at your https://github.com/overlookmotel/yauzl-mac , but I didn't quite get to a point where I understand how it works. Those test results are pretty impressive.
Out of curiosity, how are you examining the corrupt zipfiles to determine things like this? I've been using a tool I made (that's undergoing an overhaul and translation into a native language here: https://github.com/thejoshwolfe/hexdump-zip ). Are you just using a general hexdump/viewer or are you using something specifically for zipfiles? |
Thanks! As you said, Archive Utility is faulty, but faulty in an entirely predictable way. So actually, it wasn't hard to get it to parse the faulty files. My implementation is more complex than it should be only because it's doing all kinds of weird things to monkey-patch yauzl from the outside. It'd be much cleaner implementing inside yauzl. If you're having trouble disentangling how my code works, please just shout.
I was just using a file viewer (HexFiend on Mac OS). That's a pain, obviously, but I've not had to do much of it. It's easy to spot that the Data Descriptor is at the end of the file as it's so short. hexdump-zip looks good. I'll have a play (once I've got my head around installing zig). |
By the way, if you'd prefer I release the Archive Utility parsing capability as a separate npm module, I'd be happy to do that. Or maybe that could be a temporary stopgap if you're not going to have time for this in near future? If so, could you possibly merge #78 to make that possible? |
Sorry for the delay. I'm going to work on this issue next week.
I discovered that Archive Utility does not consistently put Data Descriptors at the end of file contents. I can't see a pattern for when they are there and when they aren't.
A lot of effort has gone into making this painless for Linux and Windows. Unfortunately Mac is the difficult OS to get a Zig compiler on. If you have one of those other OS's, it might be worth downloading the precompiled, dependency-less binary that's auto built from zig's master branch here: https://ziglang.org/download/ |
No worries!
The pattern I've seen is:
Symlinks are stored like a file with the file content being the path the link points to. Local File Headers:
Central Directory:
Weird that the Extra Fields in the Local File Header and Central Directory don't match! I haven't found any Archive Utility zip files that don't fit this pattern (apart from the 2 I mentioned which have no Central Directory at all). Have you seen files which don't fit this pattern? |
Ah bugger. I am Mac! |
@thejoshwolfe you can cross compile a macos hexdump-zip binary like this:
I just tried it, it works. Here it is |
@andrewrk Unfortunately I can't run that on Mac OS Sierra 10.12.6. It errors. But this isn't the right issue to list the errors etc. Where is? |
This is mentioned in the spec:
This is a struct that stores some unix file info, like mtime, atime, and uid. |
Hey all, I was wondering if there was any additional work planned on this problem? We are running into a similar issue with these ZIPs from the Archive utility, and it would be great if there was a way to extract them. Happy to help out in any way. |
@overlookmotel I gotta admit https://github.com/overlookmotel/yauzl-mac is really hard to read. I wonder if just forking yauzl would have made more sense. Ah well. I believe this is the set of fields we have to check for overflow:
Did I miss any? The One thing I'd really like to get going with this is the ability to test code without lugging around enormous zip files. I've come up with a plan for a test harness that should fit reasonably into git and hopefully be performant enough to run by default with I didn't make as much progress on this tonight as I was hoping, but I'll continue work on this shortly. |
Yes, I'm sorry it is hard to read. Probably I should have just forked yauzl. However, I generally avoid forks as then there's a burden of keeping it up to date, and you'd made it clear that you wouldn't accept a PR as you wanted to implement yourself (fair enough). Really I was just doing it because I got interested and wanted to solve the problem for the sake of solving it, so I did it as code external to yauzl. I think the fields you list above are the only ones that need checking - except a few others if you try to fingerprint Mac OS ZIP files. My logic with the fingerprinting of Mac OS Archive files is:
The It'd be completely legitimate for you to decide that the complexity vs performance trade-off isn't worth it for the common use case and to skip this. Testing: Yes, I've struggled to come up with a good way of writing tests without having to have massive ZIP files to test it on. Very intrigued as to what solution you've come up with! Just to say, I am more than happy to help with this. I would happily turn my implementation into a PR, but you've said you don't want that. But if you'd like a 2nd pair of eyes on your implementation, or to talk you through my spaghetti code, please just say. |
Also, when you do have a working implementation, I'd be happy to run it on my large collection of ZIPs to flush out any bugs. |
@BenV Just so you know, yesterday I rebased PR #78 (which yauzl-mac depends on) onto latest yauzl master, and I haven't tested again since then. I think it's unlikely, but it's possible that might have introduced a bug. If it has, it'll be a syntax error / total crash kind of bug, not a nasty insidious one that wrecks your ZIP files - so don't panic! But if you find any problems please post an issue and I'll fix asap. |
@overlookmotel I attempted to use your library, but ran into some problems due to the fact that it is reading entries before I have a chance to install my own error handler. We have some other invalid zips that have absolute paths and we want to ignore those entries, and unfortunately that no longer works when I use I'll play around with it a bit more and see if I can come up with a workable solution, but if you have any ideas of what I should do to move forward it would be appreciated. |
@overlookmotel I was able to get things working by adding a hacky |
Hi @BenV. Could you possibly raise this as an issue at https://github.com/overlookmotel/yauzl-mac/issues ? I don't quite understand what you're saying, so keen to ask you questions and fix it. But don't want to crowd this thread with talk about something not directly related to yauzl. |
See #92 |
For anyone facing same problems with Mac OS ZIP files, please see yauzl-promise. v4, just published, is a from-scratch reimplementation of yauzl with a promise-based API, and supports Mac OS ZIP files. |
Sorry yall. This issue completely killed my enthusiasm for this project. I have finally returned, but the cost was that i'm never going to support Mac Archive Utility corrupt zip files. Please let me know if anyone finds a way to file a bug report against Apple, and will advise people to upvote it or something. |
Hi @thejoshwolfe. I'm really sorry. I feel responsible. In the intervening time, I've published a fork yauzl-promise which does support Mac OS faulty zips, but also changes API to one based on promises, and makes some other changes. That's gained a bit of traction, but the download numbers are completely dwarfed by those of your original Again, sorry I pushed this issue and contributed to you losing enthusiasm. Very glad you're back. |
Thanks @thejoshwolfe (and @overlookmotel too) for supporting/maintaining this open source codebase. Best wishes, Dan |
@overlookmotel you are absolutely not responsible for my feelings, haha. You've done great work here. I really appreciate your enthusiasm. If anyone's responsible for my burnout, it's me for not being more empathetic to myself (therapy is great, yall; I highly recommend therapy.). And the source of the burnout isn't anyone in this community; it's Apple. 🙂 |
Thanks. That's kind of you. And +1 on therapy! |
@thejoshwolfe and @overlookmotel, I wanted to follow-up on this thread because I'm using @overlookmotel, I mostly wanted to see if you are continuing to maintain Unfortunately, it seems like Apple's problems are a reality for anyone who wants to provide unzip functionality in the marketplace. I'm just trying to figure out the best way forward for my project. |
Nothing committed because no-one has reported any bugs! If there are any new features of thejoshwolfe has been very clear that he does not intend to support faulty MacOS ZIP files, so I'd suggest if it's a problem for you (as it was for me), use |
Using the terminal to zip them there is no problems with unzipping. But when I use the GUI for zipping, and only with bigger files (around 8 GB and up) I get the following error:
Error: invalid central directory file header signature: 0x2d8c9fa4
at /Users/dmg_/projects/test-zipper/node_modules/yauzl/index.js:253:70
at /Users/dmg_/projects/test-zipper/node_modules/yauzl/index.js:622:5
at /Users/dmg_/projects/test-zipper/node_modules/fd-slicer/index.js:32:7
at FSReqWrap.wrapper [as oncomplete] (fs.js:658:17)
Also when running zipinfo I get:
Archive: Archive.zip
Zip file size: 5946143505 bytes, number of entries: 140
warning [Archive.zip]: 4294967296 extra bytes at beginning or within zipfile
(attempting to process anyway)
I did some digging through issues and some searching and I'm pretty sure these extra bytes make it that Yauzl can't find the header of the central directory.
Do you have any clue why GUI Mac OS zipped zipfiles have these extra bytes? I do see it puts some extra '__MACOSX' folders in the zip and suspect the presence of these to have something to do with it, but that's more of a guess.
The text was updated successfully, but these errors were encountered: