Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deterministic serialisation for cross binary communication #4567

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

dipinhora
Copy link
Contributor

Prior to this commit, serialisation used the type_id of the types that it serialised and the type_ids assigned to types can vary depending on the program being compiled. This makes it impossible for cross binary serialisation even if the same pony version and types are used.

This commit changes things so that serialisation now uses a determnistically assigned serialise_id instead of the type_id. This new serialise_id is guaranteed to be the same for the same pony version and type (see details for caveats).

The changes include:

  • adding a size_t size serialise_id to pony_type_t
  • hashes the reach_type_t->name to generate either a 63 bit or 31 bit serialise_id depending on target platform
  • the hash for the serialise_id uses a key that is based on the blake2 encoding of the pony version, the target data model, and the endianness (i.e. pony-0.58.7-ac1c51b69-lp64-le). This is to ensure that only programs with the same key will have the same serialise_id to prevent silent data corruption during deserialisation due to subtle differences in platforms. This means that cross binary serialisation will only work for binaries of the same bit width (32 bit vs 64 bit), data model (ilp32, lp64, or llp64), and endianness (big endian or little endian)
  • serialisation now uses the serialise_id instead of the type_id
  • deserialisation now uses the serialise_id to look up the type_id to get the entry in the descriptor table instead of looking up the entry in the descriptor table directly based on type_id

yes, this is a ressurrection of #2949

Prior to this commit, serialisation used the `type_id` of the
types that it serialised and the `type_id`s assigned to types
can vary depending on the program being compiled. This makes it
impossible for cross binary serialisation even if the same pony
version and types are used.

This commit changes things so that serialisation now uses a
determnistically assigned `serialise_id` instead of the `type_id`.
This new `serialise_id` is guaranteed to be the same for the same
pony version and type (see details for caveats).

The changes include:

* adding a `size_t` size `serialise_id` to `pony_type_t`
* hashes the reach_type_t->name to generate either a 63 bit or
  31 bit `serialise_id` depending on target platform
* the hash for the `serialise_id` uses a `key` that is based on the
  blake2 encoding of the pony version, the target data model, and
  the endianness (i.e. `pony-0.58.7-ac1c51b69-lp64-le`). This is
  to ensure that only programs with the same `key` will have the
  same `serialise_id` to prevent silent data corruption during
  deserialisation due to subtle differences in platforms. This
  means that cross binary serialisation will only work for binaries
  of the same bit width (32 bit vs 64 bit), data model (ilp32, lp64,
  or llp64), and endianness (big endian or little endian)
* serialisation now uses the `serialise_id` instead of the `type_id`
* deserialisation now uses the `serialise_id` to look up the
  `type_id` to get the entry in the descriptor table instead of
  looking up the entry in the descriptor table directly based on
  `type_id`
@ponylang-main ponylang-main added the discuss during sync Should be discussed during an upcoming sync label Dec 5, 2024
@SeanTAllen SeanTAllen changed the title Deterministic cerealisation for cross binary communication Deterministic serialisation for cross binary communication Dec 5, 2024
@SeanTAllen SeanTAllen added the changelog - added Automatically add "Added" CHANGELOG entry on merge label Dec 5, 2024
@ponylang-main
Copy link
Contributor

Hi @dipinhora,

The changelog - added label was added to this pull request; all PRs with a changelog label need to have release notes included as part of the PR. If you haven't added release notes already, please do.

Release notes are added by creating a uniquely named file in the .release-notes directory. We suggest you call the file 4567.md to match the number of this pull request.

The basic format of the release notes (using markdown) should be:

## Title

End user description of changes, why it's important,
problems it solves etc.

If a breaking change, make sure to include 1 or more
examples what code would look like prior to this change
and how to update it to work after this change.

Thanks.

@SeanTAllen SeanTAllen requested a review from a team December 5, 2024 16:25
@SeanTAllen
Copy link
Member

What happens in the case of hash collisions with ids? That appears to be a possibility, yes?

@dipinhora
Copy link
Contributor Author

What happens in the case of hash collisions with ids? That appears to be a possibility, yes?

yes, hash collisions are possible.. no special handling is done for those in the current implementation but debug builds should fail with an assertion on startup due to (https://github.com/ponylang/ponyc/pull/4567/files#diff-8791fb0991855d735f325e9ce5c0adf9bb9f71331cc032c8e38d913979483c55R85-R97):

bool ponyint_serialise_setup(pony_type_t** table, size_t table_size,
  desc_offset_lookup_fn desc_table_offset_lookup)
{
#ifndef PONY_NDEBUG
  for(uint32_t i = 0; i < table_size; i++)
  {
    if(table[i] != NULL)
    {
      pony_assert(table[i]->id == i);
      pony_assert(desc_table_offset_lookup(table[i]->serialise_id) == i);
    }
  }
#endif

.release-notes/4567.md Outdated Show resolved Hide resolved
@@ -0,0 +1,3 @@
## Deterministic cerealisation for cross binary communication

Pony built-in serialisation can now be used between binaries compiled with the same version of the pony compiler (this would likely result in segfaults previously). Cross binary serialisation will only work for binaries of the same bit width (32 bit vs 64 bit), data model (ilp32, lp64, or llp64), and endianness (big endian or little endian) but is not limited to a single platform (for example: one can mix and match x86_64 linux and aarch64 linux because they have the same bitwidth, data model, and endianness).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need an additional section of "serialization can lead to bad things", "this can be an attack model on pony programs and should only be used with trusted input", and what not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm.. the serialise package currently has the following (some/all of which can be copied to the release notes and also be updated for the changes in this PR):

Deserialisation is fundamentally unsafe currently: there isn't yet a
verification pass to check that the resulting object graph maintains a
well-formed heap or that individual objects maintain any expected local
invariants. However, if only "trusted" data (i.e. data produced by Pony
serialisation from the same binary) is deserialised, it will always maintain a
well-formed heap and all object invariants.

Note that serialised data is not usable between different Pony binaries. This is
due to the use of type identifiers rather than a heavy-weight self-describing
serialisation schema. This also means it isn't safe to deserialise something
serialised by the same program compiled for a different platform.

The Serialise.signature method is provided
for the purposes of comparing communicating Pony binaries to determine if they
are the same. Confirming this before deserialising data can help mitigate the
risk of accidental serialisation across different Pony binaries, but does not on
its own address the security issues of accepting data from untrusted sources.


a few notes to ensure everyone understands the full scope/limitations of this PR:

  1. The Serialise.signature method is provided
    for the purposes of comparing communicating Pony binaries to determine if they
    are the same.

    i missed this in the changes so far and this method needs to be updated to return the signature of the target pony runtime rather than the program..

  2. However, if only "trusted" data (i.e. data produced by Pony
    serialisation from the same binary) is deserialised, it will always maintain a
    well-formed heap and all object invariants.

    • with this PR, we can only maintain a well-formed heap and all object invariants if the binary doing the deserialisation knows/uses all the types that were used for serialisation or else the program will assert/fail with ponyint_assert_fail("deserialise offset invalid", __FILE__, __LINE__, __func__) during deserialisation (deserialise offset invalid probably needs rewording to be more explicit about the issue/error)..

    • maintain a well-formed heap and all object invariants cannot be guaranteed for all combinations of programs because the serialise_id is only based on the type name and not the type's name and fields/memory layout (or full ast).. unfortunately, this will cause data corruption (and possibly memory clobbering) and not an assert/fail like the previous bullet because the serialise_id will be the same between the two binaries even though it shouldn't be (because a type's field layout changed) due to the limitation on how serialise_id is determined currently..

given the above, especially the second bullet under 2, maybe this PR should be reframed to be internal changes in support of cross binary serialisation until the serialise_id can be based on either the full ast or the name and fields/memory layout for each type to remove that caveat/concern (if so, it would make sense to defer updating Serialise.signature until that time also)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So basically, this doesn't get a changelog entry for now and we don't tell people about the change. that is my interpretation of "internal". is that yours?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So basically, this doesn't get a changelog entry for now and we don't tell people about the change. that is my interpretation of "internal". is that yours?

yes, unless we're ok with the large footgun that is the second bullet under "2"..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup. i think internal is a good idea. i'll remove the label. you can remove the release notes if you want, otherwise the bot will toss them on merge.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gonna leave it.. gotta keep the bots employed or they might revolt...

@SeanTAllen SeanTAllen removed the changelog - added Automatically add "Added" CHANGELOG entry on merge label Dec 5, 2024
@jemc
Copy link
Member

jemc commented Dec 10, 2024

Are there plans to do a followup PR to do the remaining work to make this a publicly documented feature?

@dipinhora
Copy link
Contributor Author

Are there plans to do a followup PR to do the remaining work to make this a publicly documented feature?

No.

@jemc
Copy link
Member

jemc commented Dec 10, 2024

Personally I find it strange to take on additional maintenance burden and implementation complexity for an undocumented/non-public "feature" (which kind of makes it not a feature?).

I'm assuming this is making some use cases possible which weren't possible before, so I'd prefer to document those and mark them as a feature, even if it means we need to include caveats about edge cases that break things.

Without that in place, then somebody could easily revert this PR later because "the docs already say that cross binary serialization doesn't work, so it's safe to simplify this code away", and they would be justified in doing so.

@jemc
Copy link
Member

jemc commented Dec 10, 2024

I'm curious to hear what @SeanTAllen thinks on this point, because it sounded like he was the one suggesting to remove the release notes.

@SeanTAllen
Copy link
Member

I view this as an improvement on the existing functionality. It isn't "you can use this" level where it should be public, but it is an improvement on the functionality that exists and a path towards someone eventually bringing it fully home.

@dipinhora
Copy link
Contributor Author

Personally I find it strange to take on additional maintenance burden and implementation complexity for an undocumented/non-public "feature" (which kind of makes it not a feature?).

fair point.. this doesn't have to be merged if it's not considered worth the tradeoff you mention..

I'm assuming this is making some use cases possible which weren't possible before, so I'd prefer to document those and mark them as a feature, even if it means we need to include caveats about edge cases that break things.

nope.. not at all (assuming folks are following the recommended practices as mentioned in the serialise package docs).. at least not as the changes in this PR currently stand..

Without that in place, then somebody could easily revert this PR later because "the docs already say that cross binary serialization doesn't work, so it's safe to simplify this code away", and they would be justified in doing so.

yep..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss during sync Should be discussed during an upcoming sync
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants