Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding multiple sections for JavaScript to read #128

Open
RReverser opened this issue Sep 6, 2019 · 4 comments
Open

Encoding multiple sections for JavaScript to read #128

RReverser opened this issue Sep 6, 2019 · 4 comments

Comments

@RReverser
Copy link
Member

This discussion originated in WebAssembly/binaryen#2327 and @kripken suggested to discuss with the wider audience to see if there's any consensus.

We want to add a custom tool-specific section(s) that could be easily readable by JavaScript and, in our case, would contain names of exported functions. The JavaScript wrapper would read names of these functions and instrument corresponding exports.

Currently, I'm emitting a same-named custom section (called "asyncify") per each such export. My motivation was:

  1. Same-named custom sections are already allowed.
  2. There is WebAssembly.customSections(module, "asyncify") API that already allows to easily enumerate all of such sections, which suggests that it wasn't a coincidence and that it's okay to duplicate sections when it makes sense.
  3. Gzip / Brotli will already do a great job at deduplicating identifiers, so size shouldn't be a concern either.

Other potential alternatives could be either:

  1. Using some binary encoding for identifiers, but that needs more JS code to decode it back, than the current one-liner.
  2. Using some comma-separated format, but that wouldn't work well with arbitrary characters in export names.
  3. Using some well-established format like JSON, but that requires including JSON encoder in the C++ tool, if we want to properly support arbitrary characters.

Current solution seems to be the best of both worlds in that it's both very easy to encode on C++ side and very easy to decode on JS side, but the only concern is that it might be not very idiomatic to have duplicate sections.

Would love to hear any thoughts.

@jgravelle-google
Copy link

Duplicating the sections feels very weird. It's also probably worse to examine a module with O(N) sections with binary-examining tools like wasm-objdump or llvm's disassembler. We also can't assume gzip will fix our redundancies, because we still need to decode the wasm module into memory.

Given that the only data we want to encode here is a list of strings, JSON seems like overkill. And given that we want to run some JS to instrument these specified exports, we already are adding more code than a one-liner.

So I think given the constraints, a trivial binary format is the right way to go. Custom section preamble, a single LEB count of how many names are in this section, then for each of the names have an LEB for the name length, followed by that many bytes of string data.
That's pretty trivial on both ends, and seems to me the most similar to how the wasm binary format is already structured. I think we could use TextDecoder to make it efficient too.

@RReverser
Copy link
Member Author

Duplicating the sections feels very weird.

On one hand, I somewhat agree about the "feels" part, on another, the fact that it's allowed and even has dedicated WebAssembly.customSections API that returns an array and not a single section, suggests that it's an intentional part of Wasm and could be used for similar situations.

And given that we want to run some JS to instrument these specified exports, we already are adding more code than a one-liner.

One-liner was referring to specifically the decoding part:
WebAssembly.customSections(m, "asyncify").map(b => textDecoder.decode(b))

The actual wrapping is pretty simple in our case to the level where implementing or pulling something for binary decoding would probably end up larger and more complicated than the actual library code, hence JSON as an alternative (since it can at least be decoded natively in JS).

@jgravelle-google
Copy link

suggests that it's an intentional part of Wasm and could be used for similar situations

Agree. I'm just thinking the ratio of section overhead to payload here isn't worth it. Each section will just have a single string name, so for every name we have 11 bytes of extra memory (1 byte for section ID, 1 byte for section length, 1 byte for custom section name length, 8 bytes for "asyncify" string). Ballparking (based on nothing) the average export name length at 10-20 chars, we're using 1.5-2x the space with this scheme. It's possible that that's fine, but the O(N) extra space makes me more nervous than O(1) schemes, whether that's JSON or custom.

One-liner was referring to specifically the decoding part:

Right, I moreso mean "100 bytes of overhead in a 200 byte file is worse than 100 bytes in a 10 MB file". How much work+code is going to matter proportional to the effort+size of the rest of the program.

implementing or pulling something for binary decoding would probably end up larger and more complicated than the actual library code

I think it's pretty trivial if the format is simple enough:

(Pulling and tweaking readLEB from https://github.com/jgravelle-google/wasm-webidl-polyfill/blob/8a88cb2de82313e0287a73945bc32a033abaef73/webIDL.js#L251, which in turn is based on how wabt does it)

const result = [];
WebAssembly.customSections(m, "asyncify").map(section => {
  let idx = 0;
  function readLEB() {
    const bytes = [];
    while (true) {
      const byte = section[idx];
      idx++;
      bytes.push(byte & 0x7f);
      if (!(byte & 0x80)) {
        break;
      }
    }
    // Bytes are stored little-endian, so read them in significance order
    let val = 0;
    for (let i = 0; i < bytes.length; ++i) {
      const byte = bytes[bytes.length - i - 1];
      val <<= 7;
      val |= byte & 0x7f;
    }
    return val;
  }

  // Actual parsing logic
  const countStrings = readLEB();
  for (let i = 0; i < countStrings; ++i) {
    const len = readLEB();

    const b = new Uint8Array(section, idx, len);
    idx += len;
    result.push(textDecoder.decode(b));
  }
});

So because we're talking about bytes, this is 789 characters. Throwing it at a minifier (https://javascript-minifier.com) squishes it down to 340. If this avoids 11 bytes per export, this pays for itself after 33 asyncified exports. This is less zippable, but resident memory impact should be similar.

So it boils down to how many asyncified exports we expect on average, and how much we're willing to pay in the small cases.

FWIW the only reason I think JSON isn't fine here is because of needing to add escaping logic to the generators (I mean it's also 3 bytes per additional name instead of 1...). It's also possible to just not have that logic in the producer and assert that those characters aren't part of the export names. C++ tools won't generate "s as part of a function name by default for example, you'd
need to do something like __attribute__((export_name("\"problematic\""))), so I'm kind of ok with just not supporting that.

@RReverser
Copy link
Member Author

@jgravelle-google These are good points. I agree that the proportion of section name to contents doesn't look great in this case.

As I said, my hope is/was that Gzip/Brotli would alleviate the file size concern, and for decoding most parsers would relatively quickly skip over custom sections.

That said, I'm getting inclined to change it to a single section myself for this case, just want to do some quick measurements on JSON vs LEB128 approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants