-
Notifications
You must be signed in to change notification settings - Fork 712
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rationale behind using bzip2 for wasm-binaries.tbz2 #1235
Comments
No particular reason. As long as we can decompress that archive using a module that is part of python3.6 I think we would happily switch to a different format if there are benefits for be had. |
@sbc100 Is the requirement of the codec being bundled with py3.6 so hard? Of course, bzip2 archives could be pulled on systems that can not/do not support zstandard, but why not add optional support for it on systems that do have |
Its not set in stone, but we would rather not add more system dependencies. Would switching to some other format that is built into python still give us some of the benefits which you are after?
Uploading 2 different versions of the archive is possible I think it would add some complexity to the upload and downloading process. If you would like to experiment with PRs to emscripten-releases and emsdk then we could see just how much complexity it would add. (See https://chromium.googlesource.com/emscripten-releases/+/d7a2d5b091de9ea6937bbe6513e055c1bf750e6d/src/build.py#246 and Lines 37 to 39 in 775ba04
|
(BTW this is the first time I've ever heard of this zstandard thing..) |
Python does support LZMA out of the box. Decompression would of course be slower than zstandard, but still around 2-3 times better than the current solution. It would also save a lot of bandwidth over bzip2. |
Actiually, looking at the code now it looks like call out to the system Lines 510 to 517 in 775ba04
That code seems to date back to 2013: fb549cd I'm guessing that code would "just work" given a |
You don't actually need lzma installed on the system. That said, bzip2 is bundled with python and still emsdk does not make use of it, calling whatever is installed on my system instead :). |
Doesn't the I guess it depends how tar was built and what version of tar is being used. |
Also, doesn't macOS tar alto detect the compression format automatically? I assume that it must otherwise the existing code would not work (since we just run |
Indeed, that is right.
Yes, likely, but I don't have any experience with Macs. |
It looks like Mac has supported tar.xz files since 10.10 (https://www.ctrl.blog/entry/archive-utility-xz.html). And it turns out we already use the xz archives for the version of Node we ship with emsdk on Linux, and nobody has complained. So I'd be in favor of switching given the size and decompression speed advantages. We would probably have to do some hackery in the emsdk installer if we want it to support getting the bz2 archives for older versions of emscripten and xz for newer versions. |
Some results from my initial attempts at switch to
So it seems like we should go for it. We could even look at speeding up compression using the I'm looking into add the magic to emsdk now (I think we will have to have it check for both filenames). |
Yup! Passing the |
This is a bit of a hack but I can't think of another way to do it. Basically when downloading SDKs, we first try the new `.xz` extension. If that fails, we fall back to the old `.tbz2`. Both these first two download attempts we run in "silent" mode. If both of them fail we re-run the original request in non-silent mode so that the error message will always contain the original `.xz` extension. See #1235
This is a bit of a hack but I can't think of another way to do it. Basically when downloading SDKs, we first try the new `.xz` extension. If that fails, we fall back to the old `.tbz2`. Both these first two download attempts we run in "silent" mode. If both of them fail we re-run the original request in non-silent mode so that the error message will always contain the original `.xz` extension. See #1235
This is a bit of a hack but I can't think of another way to do it. Basically when downloading SDKs, we first try the new `.xz` extension. If that fails, we fall back to the old `.tbz2`. Both these first two download attempts we run in "silent" mode. If both of them fail we re-run the original request in non-silent mode so that the error message will always contain the original `.xz` extension. See #1235
emscripten-releases side CL is landing, let's keep an eye on things. Any appetite to help our windows users too? The windows archive has always been the largest (although not just because of the compression). |
I'm personally inclined to leave windows alone, but mostly because i find debugging windows issues to be a lot harder than macOS or linux ones |
Closing this for now since we removed the use of bzip2 |
This is a bit of a hack but I can't think of another way to do it. Basically when downloading SDKs, we first try the new `.xz` extension. If that fails, we fall back to the old `.tbz2`. Both these first two download attempts we run in "silent" mode. If both of them fail we re-run the original request in non-silent mode so that the error message will always contain the original `.xz` extension. See emscripten-core#1235
Installing Emscripten for the first time on my machine takes approximately 1min 43.79s wall clock time. 1 min 29.44s out of this figure is spent in
bzip2 -d
decompressing thewasm-binaries.tbz2
archive, hence my question: why bzip2?BWT codecs are not a good choice for the kind of data contained inside of the archive. I have ran some tests involving better than bzip2 BWT codecs, such as bzip3, yielding an archive smaller by about 14%, but this is irrelevant as the total time spent in bzip3: (
-dj8
) is still pretty significant - 37.419s. BWT codecs tend to be symmetric either because of the SACA algorithm or the entropy coding stage. Further, they do not provide any preprocessing capabilities for executables contained within the archive.As such, I have tested a few LZ codecs. The archive produced by
zstd -9k
lies between bz2 and bz3 at around 330'331'630 bytes, but it is 25 times faster to decompress than bzip2 and 9 times faster to decompress than bzip3, hence using zstandard instead of bzip2 would improve the installation time from 1min 43s to 14s.bzip3 and zstandard are still admittedly unique on linux machines, but rather ubiquitous lzma provides an even better ratio, albeit considerably slower, which i have verified using
lzma -9k
and thenlzma -df
as 207'465'837 bytes, almost halving the distribution size (thanks to LZMA's executable code preprocessors, among others) with a decompression time of 35s.To conclude: using zstandard (or any LZ codec) instead of bzip2 would decrease download sizes by around 10% and speed up the installation process 6 times. Why is bzip2 still used?
The text was updated successfully, but these errors were encountered: