Deterministic serialisation for cross binary communication #4567

dipinhora · 2024-12-05T15:52:21Z

Prior to this commit, serialisation used the type_id of the types that it serialised and the type_ids assigned to types can vary depending on the program being compiled. This makes it impossible for cross binary serialisation even if the same pony version and types are used.

This commit changes things so that serialisation now uses a determnistically assigned serialise_id instead of the type_id. This new serialise_id is guaranteed to be the same for the same pony version and type (see details for caveats).

The changes include:

adding a size_t size serialise_id to pony_type_t
hashes the reach_type_t->name to generate either a 63 bit or 31 bit serialise_id depending on target platform
the hash for the serialise_id uses a key that is based on the blake2 encoding of the pony version, the target data model, and the endianness (i.e. pony-0.58.7-ac1c51b69-lp64-le). This is to ensure that only programs with the same key will have the same serialise_id to prevent silent data corruption during deserialisation due to subtle differences in platforms. This means that cross binary serialisation will only work for binaries of the same bit width (32 bit vs 64 bit), data model (ilp32, lp64, or llp64), and endianness (big endian or little endian)
serialisation now uses the serialise_id instead of the type_id
deserialisation now uses the serialise_id to look up the type_id to get the entry in the descriptor table instead of looking up the entry in the descriptor table directly based on type_id

yes, this is a ressurrection of #2949

Prior to this commit, serialisation used the `type_id` of the types that it serialised and the `type_id`s assigned to types can vary depending on the program being compiled. This makes it impossible for cross binary serialisation even if the same pony version and types are used. This commit changes things so that serialisation now uses a determnistically assigned `serialise_id` instead of the `type_id`. This new `serialise_id` is guaranteed to be the same for the same pony version and type (see details for caveats). The changes include: * adding a `size_t` size `serialise_id` to `pony_type_t` * hashes the reach_type_t->name to generate either a 63 bit or 31 bit `serialise_id` depending on target platform * the hash for the `serialise_id` uses a `key` that is based on the blake2 encoding of the pony version, the target data model, and the endianness (i.e. `pony-0.58.7-ac1c51b69-lp64-le`). This is to ensure that only programs with the same `key` will have the same `serialise_id` to prevent silent data corruption during deserialisation due to subtle differences in platforms. This means that cross binary serialisation will only work for binaries of the same bit width (32 bit vs 64 bit), data model (ilp32, lp64, or llp64), and endianness (big endian or little endian) * serialisation now uses the `serialise_id` instead of the `type_id` * deserialisation now uses the `serialise_id` to look up the `type_id` to get the entry in the descriptor table instead of looking up the entry in the descriptor table directly based on `type_id`

ponylang-main · 2024-12-05T16:18:26Z

Hi @dipinhora,

The changelog - added label was added to this pull request; all PRs with a changelog label need to have release notes included as part of the PR. If you haven't added release notes already, please do.

Release notes are added by creating a uniquely named file in the .release-notes directory. We suggest you call the file 4567.md to match the number of this pull request.

The basic format of the release notes (using markdown) should be:

## Title

End user description of changes, why it's important,
problems it solves etc.

If a breaking change, make sure to include 1 or more
examples what code would look like prior to this change
and how to update it to work after this change.

Thanks.

SeanTAllen · 2024-12-05T17:09:46Z

What happens in the case of hash collisions with ids? That appears to be a possibility, yes?

dipinhora · 2024-12-05T18:02:09Z

What happens in the case of hash collisions with ids? That appears to be a possibility, yes?

yes, hash collisions are possible.. no special handling is done for those in the current implementation but debug builds should fail with an assertion on startup due to (https://github.com/ponylang/ponyc/pull/4567/files#diff-8791fb0991855d735f325e9ce5c0adf9bb9f71331cc032c8e38d913979483c55R85-R97):

bool ponyint_serialise_setup(pony_type_t** table, size_t table_size,
  desc_offset_lookup_fn desc_table_offset_lookup)
{
#ifndef PONY_NDEBUG
  for(uint32_t i = 0; i < table_size; i++)
  {
    if(table[i] != NULL)
    {
      pony_assert(table[i]->id == i);
      pony_assert(desc_table_offset_lookup(table[i]->serialise_id) == i);
    }
  }
#endif

.release-notes/4567.md

SeanTAllen · 2024-12-05T18:08:49Z

.release-notes/4567.md

@@ -0,0 +1,3 @@
+## Deterministic cerealisation for cross binary communication
+
+Pony built-in serialisation can now be used between binaries compiled with the same version of the pony compiler (this would likely result in segfaults previously). Cross binary serialisation will only work for binaries of the same bit width (32 bit vs 64 bit), data model (ilp32, lp64, or llp64), and endianness (big endian or little endian) but is not limited to a single platform (for example: one can mix and match x86_64 linux and aarch64 linux because they have the same bitwidth, data model, and endianness).


I think we need an additional section of "serialization can lead to bad things", "this can be an attack model on pony programs and should only be used with trusted input", and what not.

hmmm.. the serialise package currently has the following (some/all of which can be copied to the release notes and also be updated for the changes in this PR):

Deserialisation is fundamentally unsafe currently: there isn't yet a
verification pass to check that the resulting object graph maintains a
well-formed heap or that individual objects maintain any expected local
invariants. However, if only "trusted" data (i.e. data produced by Pony
serialisation from the same binary) is deserialised, it will always maintain a
well-formed heap and all object invariants.

Note that serialised data is not usable between different Pony binaries. This is
due to the use of type identifiers rather than a heavy-weight self-describing
serialisation schema. This also means it isn't safe to deserialise something
serialised by the same program compiled for a different platform.

The Serialise.signature method is provided
for the purposes of comparing communicating Pony binaries to determine if they
are the same. Confirming this before deserialising data can help mitigate the
risk of accidental serialisation across different Pony binaries, but does not on
its own address the security issues of accepting data from untrusted sources.

a few notes to ensure everyone understands the full scope/limitations of this PR:

The Serialise.signature method is provided
for the purposes of comparing communicating Pony binaries to determine if they
are the same.

i missed this in the changes so far and this method needs to be updated to return the signature of the target pony runtime rather than the program..

However, if only "trusted" data (i.e. data produced by Pony
serialisation from the same binary) is deserialised, it will always maintain a
well-formed heap and all object invariants.

with this PR, we can only maintain a well-formed heap and all object invariants if the binary doing the deserialisation knows/uses all the types that were used for serialisation or else the program will assert/fail with ponyint_assert_fail("deserialise offset invalid", __FILE__, __LINE__, __func__) during deserialisation (deserialise offset invalid probably needs rewording to be more explicit about the issue/error)..

maintain a well-formed heap and all object invariants cannot be guaranteed for all combinations of programs because the serialise_id is only based on the type name and not the type's name and fields/memory layout (or full ast).. unfortunately, this will cause data corruption (and possibly memory clobbering) and not an assert/fail like the previous bullet because the serialise_id will be the same between the two binaries even though it shouldn't be (because a type's field layout changed) due to the limitation on how serialise_id is determined currently..

given the above, especially the second bullet under 2, maybe this PR should be reframed to be internal changes in support of cross binary serialisation until the serialise_id can be based on either the full ast or the name and fields/memory layout for each type to remove that caveat/concern (if so, it would make sense to defer updating Serialise.signature until that time also)?

So basically, this doesn't get a changelog entry for now and we don't tell people about the change. that is my interpretation of "internal". is that yours?

So basically, this doesn't get a changelog entry for now and we don't tell people about the change. that is my interpretation of "internal". is that yours?

yes, unless we're ok with the large footgun that is the second bullet under "2"..

yup. i think internal is a good idea. i'll remove the label. you can remove the release notes if you want, otherwise the bot will toss them on merge.

gonna leave it.. gotta keep the bots employed or they might revolt...

@dipinhora we discussed this during sync. can you update the serialization package documentation to note the changes that were documented in the release notes here?

and remove the release notes.

sure.. but technically:

Pony built-in serialisation can now be used between binaries compiled with the same version of the pony compiler (this would likely result in segfaults previously). Cross binary serialisation will only work for binaries of the same bit width (32 bit vs 64 bit), data model (ilp32, lp64, or llp64), and endianness (big endian or little endian) but is not limited to a single platform (for example: one can mix and match x86_64 linux and aarch64 linux because they have the same bitwidth, data model, and endianness).

isn't true because the Serialise.signature function still generates a signature that is unique to each program and the serialise package recommends using it as a safeguard to ensure that serialisation will be safe to use..

serialise package docs updated and release notes removed..

note: the serialise package docs are technically no longer correct as per my last comment (#4567 (comment)) because the Serialise.signature function still generates a signature that is unique to each program and the serialise package recommends using it as a safeguard to ensure that serialisation will be safe to use and this will prevent cross binary serialisation from working if that recommendation is followed because the signatures will not match..

jemc · 2024-12-10T19:38:02Z

Are there plans to do a followup PR to do the remaining work to make this a publicly documented feature?

dipinhora · 2024-12-10T19:43:44Z

Are there plans to do a followup PR to do the remaining work to make this a publicly documented feature?

No.

jemc · 2024-12-10T20:43:22Z

Personally I find it strange to take on additional maintenance burden and implementation complexity for an undocumented/non-public "feature" (which kind of makes it not a feature?).

I'm assuming this is making some use cases possible which weren't possible before, so I'd prefer to document those and mark them as a feature, even if it means we need to include caveats about edge cases that break things.

Without that in place, then somebody could easily revert this PR later because "the docs already say that cross binary serialization doesn't work, so it's safe to simplify this code away", and they would be justified in doing so.

jemc · 2024-12-10T20:45:42Z

I'm curious to hear what @SeanTAllen thinks on this point, because it sounded like he was the one suggesting to remove the release notes.

SeanTAllen · 2024-12-10T21:14:15Z

I view this as an improvement on the existing functionality. It isn't "you can use this" level where it should be public, but it is an improvement on the functionality that exists and a path towards someone eventually bringing it fully home.

dipinhora · 2024-12-10T22:02:33Z

Personally I find it strange to take on additional maintenance burden and implementation complexity for an undocumented/non-public "feature" (which kind of makes it not a feature?).

fair point.. this doesn't have to be merged if it's not considered worth the tradeoff you mention..

I'm assuming this is making some use cases possible which weren't possible before, so I'd prefer to document those and mark them as a feature, even if it means we need to include caveats about edge cases that break things.

nope.. not at all (assuming folks are following the recommended practices as mentioned in the serialise package docs).. at least not as the changes in this PR currently stand..

Without that in place, then somebody could easily revert this PR later because "the docs already say that cross binary serialization doesn't work, so it's safe to simplify this code away", and they would be justified in doing so.

yep..

SeanTAllen · 2025-01-07T19:39:21Z

@dipinhora can you add back release notes that explain the change and what people can get from this additional improvement on the road to our final state?

dipinhora · 2025-01-07T20:30:39Z

@dipinhora can you add back release notes that explain the change and what people can get from this additional improvement on the road to our final state?

i'm honestly not sure what is expected here...

folks get nothing from this if they're following recommendations/best practices because, as i've noted in #4567 (comment) and #4567 (comment), the Serialise.signature function still generates a signature that is unique to each program and the serialise package recommends using it as a safeguard to ensure that serialisation will be safe to use and this will prevent cross binary serialisation from working if that recommendation is followed because the signatures will not match..

is the goal to document that if folks deviate from the recommendations in the serialise package they can do cross binary serialisation (and also very likely end up using serialisation across incompatible pony binaries causing crashes and/or data corruption)? or is it something else?

SeanTAllen · 2025-01-08T00:41:01Z

@dipinhora so what is the value in this for users? nothing? is the value here only that "we are closer to having cross binary serialization"?

dipinhora · 2025-01-08T02:50:34Z

@dipinhora so what is the value in this for users? nothing? is the value here only that "we are closer to having cross binary serialization"?

as i mentioned in #4567 (comment) and alluded to in #4567 (comment), the changes in this PR are "under the hood" and until the changes to make the serialise_id based on the type's name and fields/memory layout (or full ast) is implemented and Serialise.signature is updated to return the signature of the target pony runtime rather than the program, there is no value for users and it is better to reframe this as "we are closer to having cross binary serialization"..

SeanTAllen · 2025-01-08T02:53:38Z

@jemc please see the above.

jemc

Sorry I missed the package-level docs update. The docs update there is good enough for me, in terms of showing what is now possible.

ponylang-main added the discuss during sync Should be discussed during an upcoming sync label Dec 5, 2024

remove always false if condition

162a673

SeanTAllen changed the title ~~Deterministic cerealisation for cross binary communication~~ Deterministic serialisation for cross binary communication Dec 5, 2024

SeanTAllen added the changelog - added Automatically add "Added" CHANGELOG entry on merge label Dec 5, 2024

SeanTAllen requested a review from a team December 5, 2024 16:25

add release notes

1a80c4e

SeanTAllen reviewed Dec 5, 2024

View reviewed changes

.release-notes/4567.md Outdated Show resolved Hide resolved

SeanTAllen reviewed Dec 5, 2024

View reviewed changes

Update .release-notes/4567.md

e4a197d

SeanTAllen removed the changelog - added Automatically add "Added" CHANGELOG entry on merge label Dec 5, 2024

Update serialise package docs and remove release notes

c357dcd

SeanTAllen added the changelog - changed Automatically add "Changed" CHANGELOG entry on merge label Jan 7, 2025

SeanTAllen removed the discuss during sync Should be discussed during an upcoming sync label Jan 7, 2025

ponylang-main added the discuss during sync Should be discussed during an upcoming sync label Jan 7, 2025

SeanTAllen removed the changelog - changed Automatically add "Changed" CHANGELOG entry on merge label Jan 8, 2025

jemc approved these changes Jan 14, 2025

View reviewed changes

SeanTAllen merged commit 1f5607b into ponylang:main Jan 14, 2025
25 checks passed

ponylang-main removed the discuss during sync Should be discussed during an upcoming sync label Jan 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deterministic serialisation for cross binary communication #4567

Deterministic serialisation for cross binary communication #4567

dipinhora commented Dec 5, 2024

ponylang-main commented Dec 5, 2024

SeanTAllen commented Dec 5, 2024

dipinhora commented Dec 5, 2024

SeanTAllen Dec 5, 2024

dipinhora Dec 5, 2024

SeanTAllen Dec 5, 2024

dipinhora Dec 5, 2024

SeanTAllen Dec 5, 2024

dipinhora Dec 6, 2024

SeanTAllen Dec 17, 2024

SeanTAllen Dec 17, 2024

dipinhora Dec 17, 2024

dipinhora Dec 20, 2024

jemc commented Dec 10, 2024

dipinhora commented Dec 10, 2024

jemc commented Dec 10, 2024

jemc commented Dec 10, 2024

SeanTAllen commented Dec 10, 2024

dipinhora commented Dec 10, 2024

SeanTAllen commented Jan 7, 2025

dipinhora commented Jan 7, 2025

SeanTAllen commented Jan 8, 2025

dipinhora commented Jan 8, 2025

SeanTAllen commented Jan 8, 2025

jemc left a comment

		@@ -0,0 +1,3 @@
		## Deterministic cerealisation for cross binary communication

		Pony built-in serialisation can now be used between binaries compiled with the same version of the pony compiler (this would likely result in segfaults previously). Cross binary serialisation will only work for binaries of the same bit width (32 bit vs 64 bit), data model (ilp32, lp64, or llp64), and endianness (big endian or little endian) but is not limited to a single platform (for example: one can mix and match x86_64 linux and aarch64 linux because they have the same bitwidth, data model, and endianness).

Deterministic serialisation for cross binary communication #4567

Deterministic serialisation for cross binary communication #4567

Conversation

dipinhora commented Dec 5, 2024

ponylang-main commented Dec 5, 2024

SeanTAllen commented Dec 5, 2024

dipinhora commented Dec 5, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jemc commented Dec 10, 2024

dipinhora commented Dec 10, 2024

jemc commented Dec 10, 2024

jemc commented Dec 10, 2024

SeanTAllen commented Dec 10, 2024

dipinhora commented Dec 10, 2024

SeanTAllen commented Jan 7, 2025

dipinhora commented Jan 7, 2025

SeanTAllen commented Jan 8, 2025

dipinhora commented Jan 8, 2025

SeanTAllen commented Jan 8, 2025

jemc left a comment

Choose a reason for hiding this comment