pack / unpack: a flexible data encoding scheme #9829

axic · 2020-09-16T20:54:00Z

Exchanging data with contracts is a crucial feature, which has been mostly covered by the ABI and hidden by the language with decoding and encoding done opaquely. Various constructs, such as proxies and layer-2 solutions, demand more control over the data, and as a result new functionality was introduced to widen the control for the users: abi.encode, abi.encodePacked, and abi.decode.

Important to note that abi.encode/abi.decode operate on the "non-packed" ABI encoding specification, while abi.encodePacked allows for a weird ruleset of packed encoding.

There have been several requests for more flexibility for decoding, including abi.decodePacked, which due to the ambiguity of the encoding cannot be done.

I propose to take a cue from other languages and consider the widely used pack / unpack functions:

Python packing: https://docs.python.org/2/library/struct.html
Ruby packing: https://apidock.com/ruby/String/unpack
Lua packing: http://www.inf.puc-rio.br/~roberto/struct/

Introduce two functions:

pack(<format>, <values>...) -> (bytes memory data)
unpack(<data>, <format>, <types>...) -> <variables>...)

The format is a string literal consisting of the keys described below, where the space character between the keys are ignored. The types is the same format as used by abi.decode, and the number of types must match the number of "captures" provided in the format. Both of these rules are so an encoder/decoder can be generated at compile time.

These could be placed under abi., some new namespace, or left top level.

Example:

bytes memory data = pack("i8 c20 A", -1234, "20 bytes of string!\0", this);
(int64 a, string memory b, address c) = unpack(data, "i8 c20 A", (int64, string memory, address));

Key	C type	Solidity type	Python	Ruby	Lua	Comments
t	"bool"	bool	*	*	*	(probably not needed)
b	char	int8	Y		Y
B	unsigned char	uint8	Y		Y
h	short	int16	Y		Y
H	unsigned short	uint16	Y		Y
iN	int	intN	*		Y
IN	unsigned int	uintN	*		Y
l	long	int32	Y		Y
L	unsigned long	uint32	Y		Y
cN	char[]	bytes	*		Y
c0	C-string	string	*		Y
A	char[20]	address	n/a	n/a	n/a

There are two more special symbols:

> sets big-endian mode (this is the default)
< sets little-endian mode

Additional rules:

any number of spaces between the keys are ignored
a valid lowercase hex number (in the form of [0-9a-z][0-9a-z]) is taken as a literal value to be matched
the underscore followed by a decimal number (_NN) literal signals the number of bytes to skip (or zero-pad in case of packing)
iN and IN mean that N is a decimal number literal which defines the number of bytes of that integer type
cN means that N number of bytes are expected, where N is a literal decimal number
c0 means that a zero terminated C-string is expected, furthermore it must be valid UTF-8

One of many questions: should the decoder fail (revert) if the length of input is shorter than expected? I think so, though one could argue if that is calldata then it can be "safely" zero padded.

Example for decoding ABI encoded data:

// Decodes 32-bit, skipping 12 bytes, and 20 characters.
(uint32 selector, address recipient) = unpack(data, “L _12 A”, (uint32, address));

This is only a rough draft and I think we have to be careful at selecting this initial list of supported keys. I think the above is a good starting point, but not something final. It could also eventually deprecate abi.encodePacked.

The text was updated successfully, but these errors were encountered:

axic · 2020-09-16T21:30:19Z

As an extension we could consider supporting structs to be encoded directly, marked with the key x, and considering the following restriction: only value types are supported and dynamic types or arrays are not. The structs are encoded as they are laid out in memory (such as struct A { uint8, uint16 } would be encoded as two zero padded 32-byte fields). Furthermore could consider introducing the packed keyword for structs to remove any padding from the encoding (such as packed struct { uint8, uint16 } would be encoded as 3 bytes).

leonardoalt · 2020-09-17T17:26:06Z

Nice! I think it's a good idea, and it makes sense to reuse the [sort of] common format the other languages have.

axic · 2020-09-17T17:52:30Z

Thinking about this perhaps we should consider making the above compile-time fixed-size only. This means removing the c0 key as that depends on the input.

If the expected length and the generated output length is known at compile time, then

the compiler can insert a length-check statement and revert if out of bounds,
it would help the SMT checker to at least argue about the input/output lengths,
it would simplify the generated code.

I don't see any immediate use case which would need that dynamic behaviour, in fact I would expect the dynamic behaviour to be handled by the caller. Also note that Python, Ruby, and Lua, each to some extent support dynamic behaviour, some stretching it to provide a "semi-regexp" format. We absolutely want to avoid that here.

Should also consider replacing abi.encodePacked with a specialised version called concat. Perhaps the difference is that smaller number of types are supported.

Some example code follows:

Eth2 deposit contract

before:

        bytes32 node = sha256(abi.encodePacked(
            sha256(abi.encodePacked(pubkey_root, withdrawal_credentials)),
            sha256(abi.encodePacked(amount, bytes24(0), signature_root))
        ));

after:

        bytes32 node = sha256(pack("c32 c32",
            sha256(pack("c32 c32", pubkey_root, withdrawal_credentials)),
            sha256(pack("<I8 >c24 c32", deposit_amount, bytes24(0), signature_root))
        ));

Note due to support for endianness, there is no need for the to_little_endian_64 helper function.

Could also choose to represent the padding as a literal, such as <I8 0000..0000 c32 where the 00... part is 48 nibbles long, but the above seems shorter.

openzeppelin/ecdsa

before:

        // Check the signature length
        if (signature.length != 65) {
            revert("ECDSA: invalid signature length");
        }

        // Divide the signature in r, s and v variables
        bytes32 r;
        bytes32 s;
        uint8 v;

        // ecrecover takes the signature parameters, and the only way to get them
        // currently is to use assembly.
        // solhint-disable-next-line no-inline-assembly
        assembly {
            r := mload(add(signature, 0x20))
            s := mload(add(signature, 0x40))
            v := byte(0, mload(add(signature, 0x60)))
        }

after:

        (bytes32 r, bytes32 s, uint8 v) = unpack(signature, "c32 c32 B", (bytes32, bytes32, uint8));

axic · 2020-09-17T21:00:00Z

As an alternate proposal (triggered by @ekpyron) we could consider is possible if we have #9170 (wasn't there an earlier issue for this? we discussed it as early as 2016).

The above examples in order would look like this:

// Decodes 32-bit, skipping 12 bytes, and 20 characters.
require(data.length == 36);
uint32 selector = uint32(bytes4(data[:4]));
address recipient = address(bytes20(data[24:]));

        bytes32 node = sha256(concat(
            sha256(concat(pubkey_root, withdrawal_credentials)),
            sha256(concat(bswap(bytes8(uint64(deposit_amount))), bytes24(0), signature_root)
        ));

Here we use bswap to perform endianness swap. It would be possible to consider bytes8_le or some other functionality in the type system to deal with this.

require(signature.length == 65);
bytes32(signature[:32])
bytes32(signature[32:64])
uint8(signature[65])

This one is actually pretty neat.

ekpyron · 2020-09-17T21:03:14Z

This is also related: #8772
That could e.g. replace the cN thing for N > 32, etc.

axic · 2020-09-17T21:03:24Z

The original proposal was written a while back and initially looked to support more dynamic features. If we make it entirely static, it probably is possible to drop the formatting string and rely on the types only -- which basically means this is a proposal to clarify which types abi.encodePacked (and a potential abi.decodePacked) support, and how it encodes them.

chriseth · 2020-09-18T08:56:57Z

Would it be possible to use template syntax for the format specifier, so that we can do proper type checking? I.e.
pack::<format>(v1,v2,v3)

This would maybe require the format to be something else than a string, but maybe not.

chriseth · 2020-09-21T09:35:51Z

I prefer the more verbose way to do it (with the added benefit that it is mainly in "userland"). The abbreviations are just very hard to remember correctly. The downside is that the offsets have to be matched, and I wonder if there is some other mechanism to do it that looks more like:

Unpacker memory unpack(signature);
bytes32 r = unpack.Bytes32();
bytes32 s = unpack.Bytes32();
uint8 v = unpack.Uint8();
unpack.end();

One drawback I can see is that people could be tempted to use the unpacking functions directly in function arguments here evaluation order is not always strictly adhered to.

github-actions · 2023-03-08T12:05:13Z

This issue has been marked as stale due to inactivity for the last 90 days.
It will be automatically closed in 7 days.

github-actions · 2023-03-16T12:04:56Z

Hi everyone! This issue has been automatically closed due to inactivity.
If you think this issue is still relevant in the latest Solidity version and you have something to contribute, feel free to reopen.
However, unless the issue is a concrete proposal that can be implemented, we recommend starting a language discussion on the forum instead.

cameel added feature language design Any changes to the language, e.g. new features labels Sep 25, 2020

axic mentioned this issue Dec 21, 2020

stdlib prototype #10639

Closed

4 tasks

This was referenced Feb 5, 2021

accounts/abi: implement abi.encodePacked ethereum/go-ethereum#22273

Closed

Introduce concat or bytes.concat #10903

Closed

nikola-matic removed the feature label Aug 17, 2022

github-actions bot added the stale The issue/PR was marked as stale because it has been open for too long. label Mar 8, 2023

github-actions bot added the closed due inactivity The issue/PR was automatically closed due to inactivity. label Mar 16, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pack / unpack: a flexible data encoding scheme #9829

pack / unpack: a flexible data encoding scheme #9829

axic commented Sep 16, 2020 •

edited

Loading

axic commented Sep 16, 2020

leonardoalt commented Sep 17, 2020

axic commented Sep 17, 2020 •

edited

Loading

axic commented Sep 17, 2020 •

edited

Loading

ekpyron commented Sep 17, 2020

axic commented Sep 17, 2020

chriseth commented Sep 18, 2020

chriseth commented Sep 21, 2020

github-actions bot commented Mar 8, 2023

github-actions bot commented Mar 16, 2023

pack / unpack: a flexible data encoding scheme #9829

pack / unpack: a flexible data encoding scheme #9829

Comments

axic commented Sep 16, 2020 • edited Loading

axic commented Sep 16, 2020

leonardoalt commented Sep 17, 2020

axic commented Sep 17, 2020 • edited Loading

Eth2 deposit contract

openzeppelin/ecdsa

axic commented Sep 17, 2020 • edited Loading

ekpyron commented Sep 17, 2020

axic commented Sep 17, 2020

chriseth commented Sep 18, 2020

chriseth commented Sep 21, 2020

github-actions bot commented Mar 8, 2023

github-actions bot commented Mar 16, 2023

axic commented Sep 16, 2020 •

edited

Loading

axic commented Sep 17, 2020 •

edited

Loading

axic commented Sep 17, 2020 •

edited

Loading