Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change ordered binary encoding to fix bad nesting complexity. #1349

Closed
erights opened this issue Nov 5, 2022 · 4 comments · Fixed by #1594
Closed

Change ordered binary encoding to fix bad nesting complexity. #1349

erights opened this issue Nov 5, 2022 · 4 comments · Fixed by #1594
Assignees

Comments

@erights
Copy link
Contributor

erights commented Nov 5, 2022

At #1260 (comment) @warner explains that the current ordered binary encoding has a bad complexity measure for nested structures, due to the doubling of bracketing escapes with each additional level.

At #1260 (comment) @gibson042 suggests a much better binary encoding without this bad complexity problem. Rather than do this as part of #1260 , I'm filing this bug so we can remember to do it later.

Either way, for any such encoding change we will need to manage the transition, much like we managed the transition from capdata to smallcaps.

@erights
Copy link
Contributor Author

erights commented Nov 5, 2022

At #1260 (comment) @warner wrote

That's a lot of iterating strings, one character at a time, repeated for every depth of the input object graph, and double that for records. ({ foo: {bar: 'baz'} } is going to see foo walked twice and bar walked four times. Let's be sure to benchmark this: I suspect this string walk will dominate the encoding time.

The alternative might be a length-prefix on each component string, which also avoids the exponential escape-the-escape-character penalty. Even if the caller (and the other layers of protocol they might be using) manages to avoid calling marshal.serialize twice on the same data, the fact that encodeArray needs to escape the output of encodePassable means that even e.g. [ [ 1, 2 ] ] is going to invoke the \u0001 escape, and [ [...Array(101)] ] is going to have 100 \u0001\u0000 sequences, and [ [ [...Array(101)] ] ] will have 200, and [ [ [ [...Array(101)] ] ] ] will have 400, etc. Every record in the nesting will cost double.

@erights
Copy link
Contributor Author

erights commented Nov 5, 2022

At #1260 (comment) @gibson042 wrote

I believe this could be fixed rather conveniently by moving the escaping logic from array encoding/decoding to string encoding/decoding and refactoring array decoding to handle depth. For example, assuming the array sigil is updated from "[" to "~" for control-character adjacency:

  • Define string escapes like const escapes = { __proto__: null, "\0": "\u{0001}0", "\x01": "\u{0001}1", "~": "\u{007F}0", "\x7F": "\u{007F}1" } (or to squeeze out every bit of performance, replace that object with an array like escapes = Array(128).fill().map((_, codePoint) => escapes[String.fromCodePoint(codePoint)]) or its sparse equivalent).
  • Encode strings like `s${passable.replaceAll(/[\0\x01~\x7F]/g, c => escapes[c])}`.
  • Decode arrays like
    const decodeArray = (encoded, decodePassable) => {
      encoded.startsWith('~') || assert.fail(X`invalid array encoding: ${encoded}`);
      const tail = encoded.slice(1);
      let elements = [];
      let depth = 0;
      let nextIndex = 0;
      let currentElementIndex = 0;
      for (let { 0: c, index } of tail.matchAll(/[\0~]/g)) {
        if (c === '~') {
          // This is the start of a nested array.
          depth += 1;
        } else {
          // This is an element terminator.
          if (index === nextIndex) {
            // A terminator after ~ or another terminator ends the deepest array.
            depth -= 1;
            (depth >= 0) ||
              assert.fail(X`unexpected element terminator: ${encoded.slice(0, index + 2)}`);
          }
          if (depth === 0) {
            // We have a complete top-level element.
            elements.push(decodePassable(tail.slice(currentElementIndex, index)));
            currentElementIndex = index + 1;
          }
        }
        // Advance the index.
        nextIndex = index + 1;
      }
      (depth === 0) || assert.fail(X`unterminated encoded array: ${encoded}`);
      (nextIndex === tail.length) ||
        assert.fail(X`unterminated encoded array element: ${tail.slice(currentElementIndex)}`);
      return harden(elements);
    };
    (and note that a further optimization is possible by decoding nested arrays into a cache rather than discarding their contents after syntax validation and thereby forcing repeat work for each increased level of depth)

Sample encodings (in which U+007F is represented by "␡"):

Input Output
[] "~"
[[]] "~~\u0000"
'foo' "sfoo"
'\0 \x01 ~ \x7F' "s\u00010 \u00011 ␡0 ␡1"
['\0'] "~s\u00010\u0000"
[['\0']] "~~s\u00010\u0000\u0000"
['~'] "~s␡0\u0000"
[['~']] "~~s␡0\u0000\u0000"
{ foo: 'bar' } "(~~sfoo\u0000\u0000~sbar\u0000\u0000"

@erights
Copy link
Contributor Author

erights commented Nov 5, 2022

At #1260 (comment) @gibson042 elaborated

And we can also reduce the size of JSON representations by abandoning characters between U+0000 and U+001F (inclusive), which require "\u…" escape sequences except for those with special single-character escapes like "\t".

This would look something like changing the array element terminator from U+0000 to "&" (updating string escapes to include "&" → "%1" and "%" → "%0") and—because there is a requirement that the terminator sorts before anything else—the error sigil from "!" to "*" (assuming we don't care about the relative sort order of types w.r.t. each other).

Input Output
[] "~"
[[]] "~~&"
'foo' "sfoo"
'\0 \x01 % & ~ \x7F' "s\u0000 \u0001 %0 %1 ␡0 ␡1"
['\0'] "~s\u0000&"
[['\0']] "~~s\u0000&&"
['~'] "~s␡0&"
[['~']] "~~s␡0&&"
[['&']] "~~s%1&&"
{ foo: 'bar' } "(~~sfoo&&~sbar&&"

@kriskowal
Copy link
Member

@gibson042 How close is this to completion? Should the title read “compact-ordered” now?

@kriskowal kriskowal added the kriskowal-review-2024-01 Issues that kriskowal wants to bring to the attention of the team for review as of January, 2024 label Jan 10, 2024
@aj-agoric aj-agoric removed the kriskowal-review-2024-01 Issues that kriskowal wants to bring to the attention of the team for review as of January, 2024 label Jan 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants