Rationale behind using bzip2 for wasm-binaries.tbz2 #1235

kspalaiologos · 2023-06-18T09:48:41Z

Installing Emscripten for the first time on my machine takes approximately 1min 43.79s wall clock time. 1 min 29.44s out of this figure is spent in bzip2 -d decompressing the wasm-binaries.tbz2 archive, hence my question: why bzip2?

BWT codecs are not a good choice for the kind of data contained inside of the archive. I have ran some tests involving better than bzip2 BWT codecs, such as bzip3, yielding an archive smaller by about 14%, but this is irrelevant as the total time spent in bzip3: (-dj8) is still pretty significant - 37.419s. BWT codecs tend to be symmetric either because of the SACA algorithm or the entropy coding stage. Further, they do not provide any preprocessing capabilities for executables contained within the archive.

As such, I have tested a few LZ codecs. The archive produced by zstd -9k lies between bz2 and bz3 at around 330'331'630 bytes, but it is 25 times faster to decompress than bzip2 and 9 times faster to decompress than bzip3, hence using zstandard instead of bzip2 would improve the installation time from 1min 43s to 14s.

bzip3 and zstandard are still admittedly unique on linux machines, but rather ubiquitous lzma provides an even better ratio, albeit considerably slower, which i have verified using lzma -9k and then lzma -df as 207'465'837 bytes, almost halving the distribution size (thanks to LZMA's executable code preprocessors, among others) with a decompression time of 35s.

To conclude: using zstandard (or any LZ codec) instead of bzip2 would decrease download sizes by around 10% and speed up the installation process 6 times. Why is bzip2 still used?

The text was updated successfully, but these errors were encountered:

sbc100 · 2023-06-20T02:04:19Z

Why is bzip2 still used?

No particular reason. As long as we can decompress that archive using a module that is part of python3.6 I think we would happily switch to a different format if there are benefits for be had.

kspalaiologos · 2023-06-20T09:35:05Z

@sbc100 Is the requirement of the codec being bundled with py3.6 so hard? Of course, bzip2 archives could be pulled on systems that can not/do not support zstandard, but why not add optional support for it on systems that do have zstd in PATH, considering that it shortens the installation time by almost an order of magnitude?

sbc100 · 2023-06-20T14:55:20Z

@sbc100 Is the requirement of the codec being bundled with py3.6 so hard?

Its not set in stone, but we would rather not add more system dependencies.

Would switching to some other format that is built into python still give us some of the benefits which you are after?

Of course, bzip2 archives could be pulled on systems that can not/do not support zstandard, but why not add optional support for it on systems that do have zstd in PATH, considering that it shortens the installation time by almost an order of magnitude?

Uploading 2 different versions of the archive is possible I think it would add some complexity to the upload and downloading process. If you would like to experiment with PRs to emscripten-releases and emsdk then we could see just how much complexity it would add. (See https://chromium.googlesource.com/emscripten-releases/+/d7a2d5b091de9ea6937bbe6513e055c1bf750e6d/src/build.py#246 and

emsdk/emsdk_manifest.json

Lines 37 to 39 in 775ba04

    
           "linux_url": "https://storage.googleapis.com/webassembly/emscripten-releases-builds/linux/%releases-tag%/wasm-binaries.tbz2", 
        
           "macos_url": "https://storage.googleapis.com/webassembly/emscripten-releases-builds/mac/%releases-tag%/wasm-binaries.tbz2", 
        
           "windows_url": "https://storage.googleapis.com/webassembly/emscripten-releases-builds/win/%releases-tag%/wasm-binaries.zip",

)

sbc100 · 2023-06-20T14:56:14Z

(BTW this is the first time I've ever heard of this zstandard thing..)

kspalaiologos · 2023-06-20T15:02:30Z

Would switching to some other format that is built into python still give us some of the benefits which you are after?

Python does support LZMA out of the box. Decompression would of course be slower than zstandard, but still around 2-3 times better than the current solution. It would also save a lot of bandwidth over bzip2.

sbc100 · 2023-06-20T15:18:09Z

Actiually, looking at the code now it looks like call out to the system tar executable to extract these archives:

emsdk/emsdk.py

Lines 510 to 517 in 775ba04

    
           # http://pythonicprose.blogspot.fi/2009/10/python-extract-targz-archive.html 
        
           def untargz(source_filename, dest_dir): 
        
             print("Unpacking '" + source_filename + "' to '" + dest_dir + "'") 
        
             mkdir_p(dest_dir) 
        
             returncode = run(['tar', '-xvf' if VERBOSE else '-xf', sdk_path(source_filename), '--strip', '1'], cwd=dest_dir) 
        
             # tfile = tarfile.open(source_filename, 'r:gz') 
        
             # tfile.extractall(dest_dir) 
        
             return returncode == 0

That code seems to date back to 2013: fb549cd

I'm guessing that code would "just work" given a .tar.xz file? (assuming the host system has lzma executable that tar can use.. I wonder, does the base macOS image include that?)

kspalaiologos · 2023-06-20T15:19:58Z

You don't actually need lzma installed on the system. That said, bzip2 is bundled with python and still emsdk does not make use of it, calling whatever is installed on my system instead :). tar -I zstd -xvf archive.tar.zst and tar -xJf file.pkg.tar.xz could work. GNU Tar detects the compression format automatically, so you can just swap out .bz2 for .xz and nobody running coreutils would notice.

sbc100 · 2023-06-20T15:26:38Z

Doesn't the tar executable fork out to the underlying lzma or zstd or bzip2 executable.. and if that is not installed the system the tar command will fail right? At least I seem to remember folks reported tar can fail when bzip2 is missing.

I guess it depends how tar was built and what version of tar is being used.

sbc100 · 2023-06-20T15:28:04Z

Also, doesn't macOS tar alto detect the compression format automatically? I assume that it must otherwise the existing code would not work (since we just run tar -xvf)

kspalaiologos · 2023-06-23T09:34:47Z

Doesn't the tar executable fork out to the underlying lzma or zstd or bzip2 executable.. and if that is not installed the system the tar command will fail right? At least I seem to remember folks reported tar can fail when bzip2 is missing.

Indeed, that is right.

Also, doesn't macOS tar alto detect the compression format automatically? I assume that it must otherwise the existing code would not work (since we just run tar -xvf)

Yes, likely, but I don't have any experience with Macs.

dschuff · 2023-07-01T01:05:32Z

It looks like Mac has supported tar.xz files since 10.10 (https://www.ctrl.blog/entry/archive-utility-xz.html). And it turns out we already use the xz archives for the version of Node we ship with emsdk on Linux, and nobody has complained. So I'd be in favor of switching given the size and decompression speed advantages.

We would probably have to do some hackery in the emsdk installer if we want it to support getting the bz2 archives for older versions of emscripten and xz for newer versions.

sbc100 · 2023-09-20T23:40:22Z

Some results from my initial attempts at switch to .xz.

File size is 25% smaller (242M vs 330M)
Compression time is 3 times slower (4m44 vs 1m31)
Decompression is about 2 times as fast (33s vs 17s)

So it seems like we should go for it. We could even look at speeding up compression using the -T0 flag to xz if that compression time is an issue.

I'm looking into add the magic to emsdk now (I think we will have to have it check for both filenames).

sbc100 · 2023-09-20T23:53:33Z

Yup! Passing the -T0 flag to xz gets compression time down to 16 seconds on my 56 core destkop (tar -I "xz -T0" -cf wasm-binaries2.tar.xz install/ ), and only sacrafixed 1% on side (246M vs 242M).

This is a bit of a hack but I can't think of another way to do it. Basically when downloading SDKs, we first try the new `.xz` extension. If that fails, we fall back to the old `.tbz2`. Both these first two download attempts we run in "silent" mode. If both of them fail we re-run the original request in non-silent mode so that the error message will always contain the original `.xz` extension. See #1235

dschuff · 2023-09-22T20:44:26Z

emscripten-releases side CL is landing, let's keep an eye on things. Any appetite to help our windows users too? The windows archive has always been the largest (although not just because of the compression).

sbc100 · 2023-09-22T20:51:00Z

I'm personally inclined to leave windows alone, but mostly because i find debugging windows issues to be a lot harder than macOS or linux ones

sbc100 · 2023-09-26T16:10:00Z

Closing this for now since we removed the use of bzip2

This is a bit of a hack but I can't think of another way to do it. Basically when downloading SDKs, we first try the new `.xz` extension. If that fails, we fall back to the old `.tbz2`. Both these first two download attempts we run in "silent" mode. If both of them fail we re-run the original request in non-silent mode so that the error message will always contain the original `.xz` extension. See emscripten-core#1235

This should have been part of #1235

sbc100 mentioned this issue Sep 21, 2023

Switch to .xz by default for SDK downloads #1281

Merged

sbc100 closed this as completed Sep 26, 2023

sbc100 added a commit that referenced this issue Oct 10, 2023

Update file extension used by create_release script

7b604e0

This should have been part of #1235

sbc100 added a commit that referenced this issue Oct 10, 2023

Update file extension used by create_release script

997301a

This should have been part of #1235

sbc100 mentioned this issue Oct 10, 2023

Update file extension used by create_release script #1285

Merged

sbc100 added a commit that referenced this issue Oct 10, 2023

Update file extension used by create_release script (#1285)

8e82384

This should have been part of #1235

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rationale behind using bzip2 for wasm-binaries.tbz2 #1235

Rationale behind using bzip2 for wasm-binaries.tbz2 #1235

kspalaiologos commented Jun 18, 2023

sbc100 commented Jun 20, 2023

kspalaiologos commented Jun 20, 2023

sbc100 commented Jun 20, 2023

sbc100 commented Jun 20, 2023 •

edited

Loading

kspalaiologos commented Jun 20, 2023

sbc100 commented Jun 20, 2023

kspalaiologos commented Jun 20, 2023 •

edited

Loading

sbc100 commented Jun 20, 2023

sbc100 commented Jun 20, 2023

kspalaiologos commented Jun 23, 2023

dschuff commented Jul 1, 2023

sbc100 commented Sep 20, 2023

sbc100 commented Sep 20, 2023

dschuff commented Sep 22, 2023

sbc100 commented Sep 22, 2023

sbc100 commented Sep 26, 2023

Rationale behind using bzip2 for wasm-binaries.tbz2 #1235

Rationale behind using bzip2 for wasm-binaries.tbz2 #1235

Comments

kspalaiologos commented Jun 18, 2023

sbc100 commented Jun 20, 2023

kspalaiologos commented Jun 20, 2023

sbc100 commented Jun 20, 2023

sbc100 commented Jun 20, 2023 • edited Loading

kspalaiologos commented Jun 20, 2023

sbc100 commented Jun 20, 2023

kspalaiologos commented Jun 20, 2023 • edited Loading

sbc100 commented Jun 20, 2023

sbc100 commented Jun 20, 2023

kspalaiologos commented Jun 23, 2023

dschuff commented Jul 1, 2023

sbc100 commented Sep 20, 2023

sbc100 commented Sep 20, 2023

dschuff commented Sep 22, 2023

sbc100 commented Sep 22, 2023

sbc100 commented Sep 26, 2023

sbc100 commented Jun 20, 2023 •

edited

Loading

kspalaiologos commented Jun 20, 2023 •

edited

Loading