-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP feat(patterns): pattern-based compression take2 #1584
Draft
erights
wants to merge
1
commit into
markm-prepare-for-extended-matchers
Choose a base branch
from
markm-pattern-based-compression-2
base: markm-prepare-for-extended-matchers
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
WIP feat(patterns): pattern-based compression take2 #1584
erights
wants to merge
1
commit into
markm-prepare-for-extended-matchers
from
markm-pattern-based-compression-2
+1,053
−15
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
erights
force-pushed
the
markm-pattern-based-compression-2
branch
from
May 10, 2023 06:44
241b2d3
to
f57ac4b
Compare
erights
force-pushed
the
markm-pattern-based-compression-2
branch
from
May 20, 2023 21:45
f57ac4b
to
533d62a
Compare
erights
force-pushed
the
markm-pattern-based-compression-2
branch
from
June 6, 2023 03:22
533d62a
to
7ce2d16
Compare
erights
force-pushed
the
markm-pattern-based-compression-2
branch
from
August 8, 2023 02:23
7ce2d16
to
1025466
Compare
erights
force-pushed
the
markm-pattern-based-compression-2
branch
2 times, most recently
from
August 8, 2023 02:36
18db466
to
accc77c
Compare
erights
force-pushed
the
markm-tag-guards-2
branch
3 times, most recently
from
August 9, 2023 02:27
b05871a
to
2a13b3d
Compare
erights
force-pushed
the
markm-pattern-based-compression-2
branch
from
August 9, 2023 02:34
accc77c
to
2e6810f
Compare
erights
force-pushed
the
markm-type-guards
branch
from
August 15, 2023 22:53
a0170df
to
505f81f
Compare
erights
force-pushed
the
markm-pattern-based-compression-2
branch
from
August 15, 2023 23:02
2e6810f
to
99b58d6
Compare
erights
force-pushed
the
markm-type-guards
branch
from
August 21, 2023 22:48
505f81f
to
c2cd034
Compare
erights
force-pushed
the
markm-pattern-based-compression-2
branch
2 times, most recently
from
August 28, 2023 05:22
282fd46
to
b77b6f7
Compare
erights
force-pushed
the
markm-pattern-based-compression-2
branch
2 times, most recently
from
August 30, 2023 01:23
be5d3aa
to
3a169ed
Compare
erights
force-pushed
the
markm-pattern-based-compression-2
branch
2 times, most recently
from
September 16, 2023 02:45
7125ac7
to
061c7e6
Compare
erights
force-pushed
the
markm-pattern-based-compression-2
branch
2 times, most recently
from
September 26, 2023 03:13
5497b03
to
ce825a7
Compare
erights
force-pushed
the
markm-prepare-for-extended-matchers
branch
from
June 9, 2024 20:44
9f14fe9
to
7c42f56
Compare
erights
force-pushed
the
markm-pattern-based-compression-2
branch
from
June 9, 2024 20:44
e1eb82d
to
1c9dc8e
Compare
erights
force-pushed
the
markm-prepare-for-extended-matchers
branch
from
June 13, 2024 13:31
7c42f56
to
2d20d8e
Compare
erights
force-pushed
the
markm-pattern-based-compression-2
branch
from
June 13, 2024 13:32
1c9dc8e
to
bb79e79
Compare
erights
force-pushed
the
markm-prepare-for-extended-matchers
branch
from
June 22, 2024 03:34
2d20d8e
to
f013614
Compare
erights
force-pushed
the
markm-pattern-based-compression-2
branch
from
June 22, 2024 03:35
bb79e79
to
c079763
Compare
erights
force-pushed
the
markm-prepare-for-extended-matchers
branch
from
July 3, 2024 00:22
f013614
to
92befa7
Compare
erights
force-pushed
the
markm-pattern-based-compression-2
branch
from
July 3, 2024 00:23
c079763
to
7af6f89
Compare
erights
force-pushed
the
markm-prepare-for-extended-matchers
branch
from
July 13, 2024 23:06
92befa7
to
b4b09cd
Compare
erights
force-pushed
the
markm-pattern-based-compression-2
branch
from
July 13, 2024 23:08
7af6f89
to
c6d0e20
Compare
erights
force-pushed
the
markm-prepare-for-extended-matchers
branch
from
July 22, 2024 01:11
b4b09cd
to
a71dd8f
Compare
erights
force-pushed
the
markm-pattern-based-compression-2
branch
from
July 22, 2024 01:11
c6d0e20
to
5566832
Compare
erights
force-pushed
the
markm-prepare-for-extended-matchers
branch
from
August 3, 2024 00:18
a71dd8f
to
552cdca
Compare
erights
force-pushed
the
markm-pattern-based-compression-2
branch
from
August 3, 2024 00:18
5566832
to
da664f9
Compare
erights
force-pushed
the
markm-prepare-for-extended-matchers
branch
from
August 13, 2024 17:38
552cdca
to
b6ab0e1
Compare
erights
force-pushed
the
markm-pattern-based-compression-2
branch
from
August 13, 2024 17:39
da664f9
to
ce1dac5
Compare
erights
force-pushed
the
markm-prepare-for-extended-matchers
branch
from
August 14, 2024 20:53
b6ab0e1
to
bd279f6
Compare
erights
force-pushed
the
markm-pattern-based-compression-2
branch
from
August 14, 2024 20:54
ce1dac5
to
a222c71
Compare
erights
force-pushed
the
markm-prepare-for-extended-matchers
branch
from
September 2, 2024 21:16
bd279f6
to
711ef1c
Compare
erights
force-pushed
the
markm-pattern-based-compression-2
branch
from
September 2, 2024 21:16
a222c71
to
0c316b9
Compare
erights
force-pushed
the
markm-prepare-for-extended-matchers
branch
from
September 7, 2024 20:25
711ef1c
to
1e4653e
Compare
erights
force-pushed
the
markm-pattern-based-compression-2
branch
from
September 7, 2024 20:26
0c316b9
to
8e72f8c
Compare
erights
force-pushed
the
markm-prepare-for-extended-matchers
branch
from
October 14, 2024 19:16
1e4653e
to
c164404
Compare
erights
force-pushed
the
markm-pattern-based-compression-2
branch
from
October 14, 2024 19:18
8e72f8c
to
ce96699
Compare
erights
force-pushed
the
markm-prepare-for-extended-matchers
branch
from
October 28, 2024 23:35
c164404
to
21e35d9
Compare
erights
force-pushed
the
markm-pattern-based-compression-2
branch
from
October 28, 2024 23:37
ce96699
to
a4dd6c9
Compare
erights
force-pushed
the
markm-prepare-for-extended-matchers
branch
from
November 17, 2024 00:49
21e35d9
to
9be1bfe
Compare
erights
force-pushed
the
markm-pattern-based-compression-2
branch
from
November 17, 2024 00:49
a4dd6c9
to
c933684
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Staged on #2248
closes: #2112
refs: #1564 Agoric/agoric-sdk#6432
Description
Adds two new exports to @endo/patterns
and its "inverse"
(From Agoric/agoric-sdk#6432 (comment) ):
For example without compression, the Zoe proposal
is stored with a smallcaps body of
'#{"exit":{"afterDeadline":{"deadline":"+11","timer":"$0.Alleged: timer"}},"give":{"Bid":{"brand":"$1.Alleged: simoleans","value":"+37"}},"want":{"Winnings":{"brand":"$2.Alleged: moola","value":{"#tag":"copyBag","payload":[[{"foo":"c"},"+1"],[{"foo":"b"},"+1"],[{"foo":"a"},"+1"]]}}}}'
But it compresses with the proposalShape
to
whose smallcaps body is
'#[[["c"],["b"],["a"]],"+37","+11"]'
which is 12% as long.
It would take much more work, but if we were able to use matching interface guards on the sending and receiving sides, we'd get similar savings for messages. Agoric/agoric-sdk#6355 may help get there. But note the difficulties explained in "Upgrade Considerations" below.
mustCompress
is analogous tomustMatch
, which as a reminder isThe following equivalences must hold
mustMatch(s,p,l1?)
must succeed iffmuchCompress(s,p,l2?)
succeeds. When they succeed, the label does not matter.label
to be more informative. Thus, one throws iff the other throws. The diagnostics are not necessarily the same.mustMatch(s,p,l1?)
and thereforemustCompress(s,p,l2?)
succeeds iffcompress(s,p) === true
.mustMatch(s,p,l?) === c
iffmustDecompress(c,p,l) === s2
wheres
ands2
have the same distributed object semantics.compareRank(s, s2) === 0
,isKey(s) === isKey(s2)
,isKey(s) =>
keyEQ(s,s2)`.The point is that typically
c
is smaller thans
, though in some cases it may be larger. The space savings should typically be similar to the space savings from schema-based encodings like protobuf or capn-proto. The pattern is analogous to the schema. Anything that must be in all specimens that match a given pattern can be omitted from the compressed form, since those parts can be recovered from the pattern on decompression. Unlike schema-based compression, this can include dynamic elements like brand identity, potentially resulting in greater savings and tighter error checking.Unlike schema-based compression schemes like protobuf or cap'n proto, the layering here makes compression mostly independent of encoding/serialization, as shown by the above example: The compression is independent of whether the result will be encoded with smallcaps, and the smallcaps encoding is independent of whether its input was a compressed or uncompressed specimen. Or rather, mostly independent. We chose a nested-array compression because of its compact JSON representation, preserved by smallcaps.
Security Considerations
If sender and receiver can be led into compressing and decompressing with different patterns, or with different compression/decompression algorithms associated with that pattern's matchers, then compressed data might be decompressed into something arbitrarily different that the sender meant to send. See "Upgrade Considerations" below.
Aside from that, none.
Scaling Considerations
The whole point. Compression could result in tremendously less data stored, send, and received. Unfortunately, so far, the informal measurements of the time taken to compress is not encouraging. This needs to be measured carefully, and probably needs to be improved tremendously, before this PR is ready for production use. Ideally:
encode(mustCompress(data, pattern))
typically takes both less time and less space thanmustMatch(data, pattern) && encode(data)
.mustDecompress(decode(encodedCompressedData))
typically takes less time thandecode(encodedUncompressedData)
.This will depend of course on what
encode
scheme is used.Documentation Considerations
Testing Considerations
Already includes good manual tests.
Compatibility Considerations
A big advantage of smallcaps encoded of an uncompressed specimen is that the result is still mostly human readable, and processable using JSON-oriented tooling like jq. The compressed form loses both of these benefits, also calling into question whether there's any point in smallcaps encoding the compressed form rather than using an unreadable binary encoding like
compactOrdered
,syrup
orcbor
.compactOrdered
is both rank equality preserving and rank order preserving. Holding the pattern constant,compactOrdered
of the compressed form would still be rank equality preserving, but not rank order preserving. Thus, stores will probably continue to encode their keys usingcompactOrdered
on the uncompressed form, forfeiting the opportunity to usekeyShape
for compression.Upgrade Considerations
When the compressed form is communicated instead of the uncompressed form, the sender and receiver must agree precisely on the pattern. If a different pattern is used to uncompress than was used to compress, the compressed data might silently uncompress into data arbitrarily different than the original specimen. The best way to do this is to send the pattern as well somehow from the sender to receiver. For small data, this may cost more space than it saves.
SwingSet already stores optional patterns with some large data stores, with an error check to ensure that the data matches the pattern:
keyShape
,valueShape
, andstateShape
. Agoric/agoric-sdk#6432 modifies SwingSet to also use thevalueShape
andstateShape
for compression.A pattern is a tree of copy-data to be matched literally (the key-like parts), and Matchers, typically expressed in code like
M.bagOf(keyShape, countShape)
in the example above. The overall compression/decompression algorithms are composed from compression/decompression algorithms for each matcher kind. Not only must the sender and receiver agree exactly on the pattern, they must agree exactly on the algorithms associated with each matcher in the pattern. But we'd also like to improve these over time. Thus, this PR includes in each matcher kind definition an optional version number of the compression algorithm it uses. If omitted, that matcher does not compress. Version numbers are assigned in increasing sequence starting with1
. The algorithm associated with a given sequence number must never change. If a given version of the endo supports matcher M sequence number N, then it should also support all sequence numbers prior to N, unless there is a compelling reason to retire an old one.The
M.something(...)
matcher makers should generally produce a matcher with the latest locally supported sequence number. Thus, this system supports older senders sending to newer receivers. This works fine for intra-vat storage, as in Agoric/agoric-sdk#6432 , since intra-vat storage communicates data only forward in time/versions. However, inter-vat communications must tolerate some version slippage in both direction, which will require design of some kind of pattern negotiation.[ ] Includes*BREAKING*:
in the commit message with migration instructions for any breaking change.This PR itself does not introduce any breaking changes. But PRs based on it will have more hazards of breaking changes as explained above.
NEWS.md
for user-facing changes.Many of the points made in this PR note should be summarized in a NEWS.md entry.