-
-
Notifications
You must be signed in to change notification settings - Fork 416
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deterministic serialisation for cross binary communication #4567
base: main
Are you sure you want to change the base?
Conversation
Prior to this commit, serialisation used the `type_id` of the types that it serialised and the `type_id`s assigned to types can vary depending on the program being compiled. This makes it impossible for cross binary serialisation even if the same pony version and types are used. This commit changes things so that serialisation now uses a determnistically assigned `serialise_id` instead of the `type_id`. This new `serialise_id` is guaranteed to be the same for the same pony version and type (see details for caveats). The changes include: * adding a `size_t` size `serialise_id` to `pony_type_t` * hashes the reach_type_t->name to generate either a 63 bit or 31 bit `serialise_id` depending on target platform * the hash for the `serialise_id` uses a `key` that is based on the blake2 encoding of the pony version, the target data model, and the endianness (i.e. `pony-0.58.7-ac1c51b69-lp64-le`). This is to ensure that only programs with the same `key` will have the same `serialise_id` to prevent silent data corruption during deserialisation due to subtle differences in platforms. This means that cross binary serialisation will only work for binaries of the same bit width (32 bit vs 64 bit), data model (ilp32, lp64, or llp64), and endianness (big endian or little endian) * serialisation now uses the `serialise_id` instead of the `type_id` * deserialisation now uses the `serialise_id` to look up the `type_id` to get the entry in the descriptor table instead of looking up the entry in the descriptor table directly based on `type_id`
Hi @dipinhora, The changelog - added label was added to this pull request; all PRs with a changelog label need to have release notes included as part of the PR. If you haven't added release notes already, please do. Release notes are added by creating a uniquely named file in the The basic format of the release notes (using markdown) should be:
Thanks. |
What happens in the case of hash collisions with ids? That appears to be a possibility, yes? |
yes, hash collisions are possible.. no special handling is done for those in the current implementation but debug builds should fail with an assertion on startup due to (https://github.com/ponylang/ponyc/pull/4567/files#diff-8791fb0991855d735f325e9ce5c0adf9bb9f71331cc032c8e38d913979483c55R85-R97): bool ponyint_serialise_setup(pony_type_t** table, size_t table_size,
desc_offset_lookup_fn desc_table_offset_lookup)
{
#ifndef PONY_NDEBUG
for(uint32_t i = 0; i < table_size; i++)
{
if(table[i] != NULL)
{
pony_assert(table[i]->id == i);
pony_assert(desc_table_offset_lookup(table[i]->serialise_id) == i);
}
}
#endif |
@@ -0,0 +1,3 @@ | |||
## Deterministic cerealisation for cross binary communication | |||
|
|||
Pony built-in serialisation can now be used between binaries compiled with the same version of the pony compiler (this would likely result in segfaults previously). Cross binary serialisation will only work for binaries of the same bit width (32 bit vs 64 bit), data model (ilp32, lp64, or llp64), and endianness (big endian or little endian) but is not limited to a single platform (for example: one can mix and match x86_64 linux and aarch64 linux because they have the same bitwidth, data model, and endianness). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need an additional section of "serialization can lead to bad things", "this can be an attack model on pony programs and should only be used with trusted input", and what not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmm.. the serialise
package currently has the following (some/all of which can be copied to the release notes and also be updated for the changes in this PR):
Deserialisation is fundamentally unsafe currently: there isn't yet a
verification pass to check that the resulting object graph maintains a
well-formed heap or that individual objects maintain any expected local
invariants. However, if only "trusted" data (i.e. data produced by Pony
serialisation from the same binary) is deserialised, it will always maintain a
well-formed heap and all object invariants.Note that serialised data is not usable between different Pony binaries. This is
due to the use of type identifiers rather than a heavy-weight self-describing
serialisation schema. This also means it isn't safe to deserialise something
serialised by the same program compiled for a different platform.The Serialise.signature method is provided
for the purposes of comparing communicating Pony binaries to determine if they
are the same. Confirming this before deserialising data can help mitigate the
risk of accidental serialisation across different Pony binaries, but does not on
its own address the security issues of accepting data from untrusted sources.
a few notes to ensure everyone understands the full scope/limitations of this PR:
-
The Serialise.signature method is provided
for the purposes of comparing communicating Pony binaries to determine if they
are the same.i missed this in the changes so far and this method needs to be updated to return the signature of the target pony runtime rather than the program..
-
However, if only "trusted" data (i.e. data produced by Pony
serialisation from the same binary) is deserialised, it will always maintain a
well-formed heap and all object invariants.-
with this PR, we can only
maintain a well-formed heap and all object invariants
if the binary doing thedeserialisation
knows/uses all the types that were used forserialisation
or else the program will assert/fail withponyint_assert_fail("deserialise offset invalid", __FILE__, __LINE__, __func__)
duringdeserialisation
(deserialise offset invalid
probably needs rewording to be more explicit about the issue/error).. -
maintain a well-formed heap and all object invariants
cannot be guaranteed for all combinations of programs because theserialise_id
is only based on thetype name
and not the type'sname and fields/memory layout
(orfull ast
).. unfortunately, this will cause data corruption (and possibly memory clobbering) and not an assert/fail like the previous bullet because theserialise_id
will be the same between the two binaries even though it shouldn't be (because a type's field layout changed) due to the limitation on howserialise_id
is determined currently..
-
given the above, especially the second bullet under 2
, maybe this PR should be reframed to be internal changes in support of cross binary serialisation
until the serialise_id
can be based on either the full ast
or the name and fields/memory layout
for each type to remove that caveat/concern (if so, it would make sense to defer updating Serialise.signature
until that time also)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So basically, this doesn't get a changelog entry for now and we don't tell people about the change. that is my interpretation of "internal". is that yours?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So basically, this doesn't get a changelog entry for now and we don't tell people about the change. that is my interpretation of "internal". is that yours?
yes, unless we're ok with the large footgun that is the second bullet under "2"
..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup. i think internal is a good idea. i'll remove the label. you can remove the release notes if you want, otherwise the bot will toss them on merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gonna leave it.. gotta keep the bots employed or they might revolt...
Are there plans to do a followup PR to do the remaining work to make this a publicly documented feature? |
No. |
Personally I find it strange to take on additional maintenance burden and implementation complexity for an undocumented/non-public "feature" (which kind of makes it not a feature?). I'm assuming this is making some use cases possible which weren't possible before, so I'd prefer to document those and mark them as a feature, even if it means we need to include caveats about edge cases that break things. Without that in place, then somebody could easily revert this PR later because "the docs already say that cross binary serialization doesn't work, so it's safe to simplify this code away", and they would be justified in doing so. |
I'm curious to hear what @SeanTAllen thinks on this point, because it sounded like he was the one suggesting to remove the release notes. |
I view this as an improvement on the existing functionality. It isn't "you can use this" level where it should be public, but it is an improvement on the functionality that exists and a path towards someone eventually bringing it fully home. |
fair point.. this doesn't have to be merged if it's not considered worth the tradeoff you mention..
nope.. not at all (assuming folks are following the recommended practices as mentioned in the
yep.. |
Prior to this commit, serialisation used the
type_id
of the types that it serialised and thetype_id
s assigned to types can vary depending on the program being compiled. This makes it impossible for cross binary serialisation even if the same pony version and types are used.This commit changes things so that serialisation now uses a determnistically assigned
serialise_id
instead of thetype_id
. This newserialise_id
is guaranteed to be the same for the same pony version and type (see details for caveats).The changes include:
size_t
sizeserialise_id
topony_type_t
serialise_id
depending on target platformserialise_id
uses akey
that is based on the blake2 encoding of the pony version, the target data model, and the endianness (i.e.pony-0.58.7-ac1c51b69-lp64-le
). This is to ensure that only programs with the samekey
will have the sameserialise_id
to prevent silent data corruption during deserialisation due to subtle differences in platforms. This means that cross binary serialisation will only work for binaries of the same bit width (32 bit vs 64 bit), data model (ilp32, lp64, or llp64), and endianness (big endian or little endian)serialise_id
instead of thetype_id
serialise_id
to look up thetype_id
to get the entry in the descriptor table instead of looking up the entry in the descriptor table directly based ontype_id
yes, this is a ressurrection of #2949