Change ordered binary encoding to fix bad nesting complexity. #1349

erights · 2022-11-05T04:57:59Z

At #1260 (comment) @warner explains that the current ordered binary encoding has a bad complexity measure for nested structures, due to the doubling of bracketing escapes with each additional level.

At #1260 (comment) @gibson042 suggests a much better binary encoding without this bad complexity problem. Rather than do this as part of #1260 , I'm filing this bug so we can remember to do it later.

Either way, for any such encoding change we will need to manage the transition, much like we managed the transition from capdata to smallcaps.

erights · 2022-11-05T05:01:35Z

At #1260 (comment) @warner wrote

That's a lot of iterating strings, one character at a time, repeated for every depth of the input object graph, and double that for records. ({ foo: {bar: 'baz'} } is going to see foo walked twice and bar walked four times. Let's be sure to benchmark this: I suspect this string walk will dominate the encoding time.

The alternative might be a length-prefix on each component string, which also avoids the exponential escape-the-escape-character penalty. Even if the caller (and the other layers of protocol they might be using) manages to avoid calling marshal.serialize twice on the same data, the fact that encodeArray needs to escape the output of encodePassable means that even e.g. [ [ 1, 2 ] ] is going to invoke the \u0001 escape, and [ [...Array(101)] ] is going to have 100 \u0001\u0000 sequences, and [ [ [...Array(101)] ] ] will have 200, and [ [ [ [...Array(101)] ] ] ] will have 400, etc. Every record in the nesting will cost double.

erights · 2022-11-05T05:02:05Z

At #1260 (comment) @gibson042 wrote

I believe this could be fixed rather conveniently by moving the escaping logic from array encoding/decoding to string encoding/decoding and refactoring array decoding to handle depth. For example, assuming the array sigil is updated from "[" to "~" for control-character adjacency:

Define string escapes like const escapes = { __proto__: null, "\0": "\u{0001}0", "\x01": "\u{0001}1", "~": "\u{007F}0", "\x7F": "\u{007F}1" } (or to squeeze out every bit of performance, replace that object with an array like escapes = Array(128).fill().map((_, codePoint) => escapes[String.fromCodePoint(codePoint)]) or its sparse equivalent).
Encode strings like `s${passable.replaceAll(/[\0\x01~\x7F]/g, c => escapes[c])}`.

Decode arrays like

const decodeArray = (encoded, decodePassable) => {
  encoded.startsWith('~') || assert.fail(X`invalid array encoding: ${encoded}`);
  const tail = encoded.slice(1);
  let elements = [];
  let depth = 0;
  let nextIndex = 0;
  let currentElementIndex = 0;
  for (let { 0: c, index } of tail.matchAll(/[\0~]/g)) {
    if (c === '~') {
      // This is the start of a nested array.
      depth += 1;
    } else {
      // This is an element terminator.
      if (index === nextIndex) {
        // A terminator after ~ or another terminator ends the deepest array.
        depth -= 1;
        (depth >= 0) ||
          assert.fail(X`unexpected element terminator: ${encoded.slice(0, index + 2)}`);
      }
      if (depth === 0) {
        // We have a complete top-level element.
        elements.push(decodePassable(tail.slice(currentElementIndex, index)));
        currentElementIndex = index + 1;
      }
    }
    // Advance the index.
    nextIndex = index + 1;
  }
  (depth === 0) || assert.fail(X`unterminated encoded array: ${encoded}`);
  (nextIndex === tail.length) ||
    assert.fail(X`unterminated encoded array element: ${tail.slice(currentElementIndex)}`);
  return harden(elements);
};

(and note that a further optimization is possible by decoding nested arrays into a cache rather than discarding their contents after syntax validation and thereby forcing repeat work for each increased level of depth)

Sample encodings (in which U+007F is represented by "␡"):

Input	Output
`[]`	`"~"`
`[[]]`	`"~~\u0000"`
`'foo'`	`"sfoo"`
`'\0 \x01 ~ \x7F'`	`"s\u00010 \u00011 ␡0 ␡1"`
`['\0']`	`"~s\u00010\u0000"`
`[['\0']]`	`"~~s\u00010\u0000\u0000"`
`['~']`	`"~s␡0\u0000"`
`[['~']]`	`"~~s␡0\u0000\u0000"`
`{ foo: 'bar' }`	`"(~~sfoo\u0000\u0000~sbar\u0000\u0000"`

erights · 2022-11-05T05:02:59Z

At #1260 (comment) @gibson042 elaborated

And we can also reduce the size of JSON representations by abandoning characters between U+0000 and U+001F (inclusive), which require "\u…" escape sequences except for those with special single-character escapes like "\t".

This would look something like changing the array element terminator from U+0000 to "&" (updating string escapes to include "&" → "%1" and "%" → "%0") and—because there is a requirement that the terminator sorts before anything else—the error sigil from "!" to "*" (assuming we don't care about the relative sort order of types w.r.t. each other).

Input	Output
`[]`	`"~"`
`[[]]`	`"~~&"`
`'foo'`	`"sfoo"`
`'\0 \x01 % & ~ \x7F'`	`"s\u0000 \u0001 %0 %1 ␡0 ␡1"`
`['\0']`	`"~s\u0000&"`
`[['\0']]`	`"~~s\u0000&&"`
`['~']`	`"~s␡0&"`
`[['~']]`	`"~~s␡0&&"`
`[['&']]`	`"~~s%1&&"`
`{ foo: 'bar' }`	`"(~~sfoo&&~sbar&&"`

Fix #1349

Fixes #1349

kriskowal · 2024-01-08T22:06:33Z

@gibson042 How close is this to completion? Should the title read “compact-ordered” now?

Fixes #1349

erights assigned erights and gibson042 Nov 5, 2022

erights mentioned this issue Nov 5, 2022

fix: Start migrating encodePassable and rankOrder from agoric-sdk to endo #1260

Merged

erights mentioned this issue Dec 24, 2022

Need composite key support in stores Agoric/agoric-sdk#5263

Open

erights mentioned this issue Mar 24, 2023

What are the core "data" types? ocapn/ocapn#5

Closed

erights mentioned this issue May 19, 2023

replace JSON-based marshal with dense binary encoding Agoric/agoric-sdk#2589

Open

gibson042 added a commit that referenced this issue May 23, 2023

fix(marshal): Add a smaller and more efficient passable encoding

9c6b26a

Fix #1349

gibson042 added a commit that referenced this issue May 23, 2023

fix(marshal): Add a smaller and more efficient passable encoding

b884789

Fixes #1349

gibson042 mentioned this issue May 23, 2023

fix(marshal): Add a smaller and more efficient passable encoding #1594

Merged

3 tasks

gibson042 added a commit that referenced this issue Aug 27, 2023

fix(marshal): Add a smaller and more efficient passable encoding

a054650

Fixes #1349

gibson042 added a commit that referenced this issue Sep 7, 2023

fix(marshal): Add a smaller and more efficient passable encoding

30bce94

Fixes #1349

gibson042 assigned erights and unassigned erights Oct 5, 2023

kriskowal added the kriskowal-review-2024-01 Issues that kriskowal wants to bring to the attention of the team for review as of January, 2024 label Jan 10, 2024

gibson042 added a commit that referenced this issue Jan 12, 2024

fix(marshal): Add a smaller and more efficient passable encoding

778cc36

Fixes #1349

gibson042 added a commit that referenced this issue Jan 18, 2024

fix(marshal): Add a smaller and more efficient passable encoding

add21b5

Fixes #1349

gibson042 added a commit that referenced this issue Jan 25, 2024

fix(marshal): Add a smaller and more efficient passable encoding

7f10063

Fixes #1349

aj-agoric removed the kriskowal-review-2024-01 Issues that kriskowal wants to bring to the attention of the team for review as of January, 2024 label Jan 31, 2024

gibson042 added a commit that referenced this issue Feb 5, 2024

fix(marshal): Add a smaller and more efficient passable encoding

a85616f

Fixes #1349

gibson042 closed this as completed in #1594 Feb 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change ordered binary encoding to fix bad nesting complexity. #1349

Change ordered binary encoding to fix bad nesting complexity. #1349

erights commented Nov 5, 2022 •

edited

Loading

erights commented Nov 5, 2022

erights commented Nov 5, 2022

erights commented Nov 5, 2022

kriskowal commented Jan 8, 2024

Change ordered binary encoding to fix bad nesting complexity. #1349

Change ordered binary encoding to fix bad nesting complexity. #1349

Comments

erights commented Nov 5, 2022 • edited Loading

erights commented Nov 5, 2022

erights commented Nov 5, 2022

erights commented Nov 5, 2022

kriskowal commented Jan 8, 2024

erights commented Nov 5, 2022 •

edited

Loading