Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pack / unpack: a flexible data encoding scheme #9829

Closed
axic opened this issue Sep 16, 2020 · 10 comments
Closed

pack / unpack: a flexible data encoding scheme #9829

axic opened this issue Sep 16, 2020 · 10 comments
Labels
closed due inactivity The issue/PR was automatically closed due to inactivity. language design :rage4: Any changes to the language, e.g. new features stale The issue/PR was marked as stale because it has been open for too long.

Comments

@axic
Copy link
Member

axic commented Sep 16, 2020

Exchanging data with contracts is a crucial feature, which has been mostly covered by the ABI and hidden by the language with decoding and encoding done opaquely. Various constructs, such as proxies and layer-2 solutions, demand more control over the data, and as a result new functionality was introduced to widen the control for the users: abi.encode, abi.encodePacked, and abi.decode.

Important to note that abi.encode/abi.decode operate on the "non-packed" ABI encoding specification, while abi.encodePacked allows for a weird ruleset of packed encoding.

There have been several requests for more flexibility for decoding, including abi.decodePacked, which due to the ambiguity of the encoding cannot be done.

I propose to take a cue from other languages and consider the widely used pack / unpack functions:


Introduce two functions:

  1. pack(<format>, <values>...) -> (bytes memory data)
  2. unpack(<data>, <format>, <types>...) -> <variables>...)

The format is a string literal consisting of the keys described below, where the space character between the keys are ignored. The types is the same format as used by abi.decode, and the number of types must match the number of "captures" provided in the format. Both of these rules are so an encoder/decoder can be generated at compile time.

These could be placed under abi., some new namespace, or left top level.

Example:

bytes memory data = pack("i8 c20 A", -1234, "20 bytes of string!\0", this);
(int64 a, string memory b, address c) = unpack(data, "i8 c20 A", (int64, string memory, address));

Key C type Solidity type Python Ruby Lua Comments
t "bool" bool * * * (probably not needed)
b char int8 Y Y
B unsigned char uint8 Y Y
h short int16 Y Y
H unsigned short uint16 Y Y
iN int intN * Y
IN unsigned int uintN * Y
l long int32 Y Y
L unsigned long uint32 Y Y
cN char[] bytes * Y
c0 C-string string * Y
A char[20] address n/a n/a n/a

There are two more special symbols:

  • > sets big-endian mode (this is the default)
  • < sets little-endian mode

Additional rules:

  • any number of spaces between the keys are ignored
  • a valid lowercase hex number (in the form of [0-9a-z][0-9a-z]) is taken as a literal value to be matched
  • the underscore followed by a decimal number (_NN) literal signals the number of bytes to skip (or zero-pad in case of packing)
  • iN and IN mean that N is a decimal number literal which defines the number of bytes of that integer type
  • cN means that N number of bytes are expected, where N is a literal decimal number
  • c0 means that a zero terminated C-string is expected, furthermore it must be valid UTF-8

One of many questions: should the decoder fail (revert) if the length of input is shorter than expected? I think so, though one could argue if that is calldata then it can be "safely" zero padded.


Example for decoding ABI encoded data:

// Decodes 32-bit, skipping 12 bytes, and 20 characters.
(uint32 selector, address recipient) = unpack(data, “L _12 A”, (uint32, address));

This is only a rough draft and I think we have to be careful at selecting this initial list of supported keys. I think the above is a good starting point, but not something final. It could also eventually deprecate abi.encodePacked.

@axic
Copy link
Member Author

axic commented Sep 16, 2020

As an extension we could consider supporting structs to be encoded directly, marked with the key x, and considering the following restriction: only value types are supported and dynamic types or arrays are not. The structs are encoded as they are laid out in memory (such as struct A { uint8, uint16 } would be encoded as two zero padded 32-byte fields). Furthermore could consider introducing the packed keyword for structs to remove any padding from the encoding (such as packed struct { uint8, uint16 } would be encoded as 3 bytes).

@leonardoalt
Copy link
Member

Nice! I think it's a good idea, and it makes sense to reuse the [sort of] common format the other languages have.

@axic
Copy link
Member Author

axic commented Sep 17, 2020

Thinking about this perhaps we should consider making the above compile-time fixed-size only. This means removing the c0 key as that depends on the input.

If the expected length and the generated output length is known at compile time, then

  • the compiler can insert a length-check statement and revert if out of bounds,
  • it would help the SMT checker to at least argue about the input/output lengths,
  • it would simplify the generated code.

I don't see any immediate use case which would need that dynamic behaviour, in fact I would expect the dynamic behaviour to be handled by the caller. Also note that Python, Ruby, and Lua, each to some extent support dynamic behaviour, some stretching it to provide a "semi-regexp" format. We absolutely want to avoid that here.

Should also consider replacing abi.encodePacked with a specialised version called concat. Perhaps the difference is that smaller number of types are supported.

Some example code follows:

Eth2 deposit contract

before:

        bytes32 node = sha256(abi.encodePacked(
            sha256(abi.encodePacked(pubkey_root, withdrawal_credentials)),
            sha256(abi.encodePacked(amount, bytes24(0), signature_root))
        ));

after:

        bytes32 node = sha256(pack("c32 c32",
            sha256(pack("c32 c32", pubkey_root, withdrawal_credentials)),
            sha256(pack("<I8 >c24 c32", deposit_amount, bytes24(0), signature_root))
        ));

Note due to support for endianness, there is no need for the to_little_endian_64 helper function.

Could also choose to represent the padding as a literal, such as <I8 0000..0000 c32 where the 00... part is 48 nibbles long, but the above seems shorter.

openzeppelin/ecdsa

before:

        // Check the signature length
        if (signature.length != 65) {
            revert("ECDSA: invalid signature length");
        }

        // Divide the signature in r, s and v variables
        bytes32 r;
        bytes32 s;
        uint8 v;

        // ecrecover takes the signature parameters, and the only way to get them
        // currently is to use assembly.
        // solhint-disable-next-line no-inline-assembly
        assembly {
            r := mload(add(signature, 0x20))
            s := mload(add(signature, 0x40))
            v := byte(0, mload(add(signature, 0x60)))
        }

after:

        (bytes32 r, bytes32 s, uint8 v) = unpack(signature, "c32 c32 B", (bytes32, bytes32, uint8));

@axic
Copy link
Member Author

axic commented Sep 17, 2020

As an alternate proposal (triggered by @ekpyron) we could consider is possible if we have #9170 (wasn't there an earlier issue for this? we discussed it as early as 2016).

The above examples in order would look like this:

// Decodes 32-bit, skipping 12 bytes, and 20 characters.
require(data.length == 36);
uint32 selector = uint32(bytes4(data[:4]));
address recipient = address(bytes20(data[24:]));

        bytes32 node = sha256(concat(
            sha256(concat(pubkey_root, withdrawal_credentials)),
            sha256(concat(bswap(bytes8(uint64(deposit_amount))), bytes24(0), signature_root)
        ));

Here we use bswap to perform endianness swap. It would be possible to consider bytes8_le or some other functionality in the type system to deal with this.


require(signature.length == 65);
bytes32(signature[:32])
bytes32(signature[32:64])
uint8(signature[65])

This one is actually pretty neat.

@ekpyron
Copy link
Member

ekpyron commented Sep 17, 2020

This is also related: #8772
That could e.g. replace the cN thing for N > 32, etc.

@axic
Copy link
Member Author

axic commented Sep 17, 2020

The original proposal was written a while back and initially looked to support more dynamic features. If we make it entirely static, it probably is possible to drop the formatting string and rely on the types only -- which basically means this is a proposal to clarify which types abi.encodePacked (and a potential abi.decodePacked) support, and how it encodes them.

@chriseth
Copy link
Contributor

Would it be possible to use template syntax for the format specifier, so that we can do proper type checking? I.e.
pack::<format>(v1,v2,v3)

This would maybe require the format to be something else than a string, but maybe not.

@chriseth
Copy link
Contributor

I prefer the more verbose way to do it (with the added benefit that it is mainly in "userland"). The abbreviations are just very hard to remember correctly. The downside is that the offsets have to be matched, and I wonder if there is some other mechanism to do it that looks more like:

Unpacker memory unpack(signature);
bytes32 r = unpack.Bytes32();
bytes32 s = unpack.Bytes32();
uint8 v = unpack.Uint8();
unpack.end();

One drawback I can see is that people could be tempted to use the unpacking functions directly in function arguments here evaluation order is not always strictly adhered to.

@cameel cameel added feature language design :rage4: Any changes to the language, e.g. new features labels Sep 25, 2020
@axic axic mentioned this issue Dec 21, 2020
4 tasks
@github-actions
Copy link

github-actions bot commented Mar 8, 2023

This issue has been marked as stale due to inactivity for the last 90 days.
It will be automatically closed in 7 days.

@github-actions github-actions bot added the stale The issue/PR was marked as stale because it has been open for too long. label Mar 8, 2023
@github-actions
Copy link

Hi everyone! This issue has been automatically closed due to inactivity.
If you think this issue is still relevant in the latest Solidity version and you have something to contribute, feel free to reopen.
However, unless the issue is a concrete proposal that can be implemented, we recommend starting a language discussion on the forum instead.

@github-actions github-actions bot added the closed due inactivity The issue/PR was automatically closed due to inactivity. label Mar 16, 2023
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
closed due inactivity The issue/PR was automatically closed due to inactivity. language design :rage4: Any changes to the language, e.g. new features stale The issue/PR was marked as stale because it has been open for too long.
Projects
None yet
Development

No branches or pull requests

6 participants