Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Direct serializer for "Serde Way" #139

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

jdarais
Copy link

@jdarais jdarais commented Mar 2, 2025

Hi, I started playing with avro-rs to build some Rust-based Avro tools, and noticed some things I could take a stab at improving in the avro-rs library itself. This PR has a lot of changes in it, but it mostly centers around a new DirectSerializer, who's behavior is described below. Feel free to consider this PR as a whole, or take bits and pieces from it. There are some things in it that could be useful even independently of the addition of DirectSerializer:

This PR proposes a direct serialization of objects when serializing the "Serde Way", which provides roughly a 5x performance improvement, and fixes issues caused by lost schema information when first converting to a Value before serializing to the Write stream (#70).

Some notes on these changes:

  • An new Serializer implementation, DirectSerializer, provides direct serialization of a type using the "Serde Way". The DirectSerializer is initialized with the schema to be used for writing data, and uses that schema as a guide to know how to serialize the serde types that it encounters.
  • The existing serde benchmark tests seemed to only test serialization of Value types, which is only part of the process in the "Serde Way", so some additional benchmark tests were added to measure the end-to-end performance of "Serde Way" serialization
  • The deserializer appears to be missing some "Avro union / Rust enum" (de)serialization capability. In the existing implementation it appears that Rust enums can be (de)serialized to/from Avro union types, but only if the #[serde(untagged)] attribute is added, which serializes an enum value as the variant type itself rather than a union type, and it's up to the (de)serializer to detect that the schema calls for a union, search the union schemas for a match, and use the first schema found that matches the data. This PR adds a UnionDeserializer to de.rs, which, in combination with DirectSerializer, allows "Avro union / Rust enum" (de)serialization without requiring the #[serde(untagged)] attribute.
  • I was able to replicate almost all behavior of the existing implementation using Value as an intermediate step, but one thing I was not able to replicate is serialization of enums that do have the #[serde(untagged)] attribute. The closest I got was making the DirectSerializer able to search a union schema for the first schema type that matches the input data, but in the case of "record" types, there's no easy way to ensure that the field names are also a match. This feature of schema matching on write does seem a bit odd to me, since the writer should know the structure of the data being written, and schema matching should only really be a concern for the reader, where schema resolution is happening between a reader and writer schema. But, worth noting that this difference in behavior is not backwards-compatible.

All unit tests are passing. You can take a look at the changes in the unit tests to see where there are differences in behavior.

@martin-g
Copy link
Member

martin-g commented Mar 3, 2025

Thanks for the PR!
Please resolve the merge conflicts!

@jdarais jdarais force-pushed the jd-feat-direct-serializer branch from e2b0d22 to 422fbaa Compare March 3, 2025 18:49
@jdarais
Copy link
Author

jdarais commented Mar 3, 2025

Oops, I hadn't realized my branch had fallen behind that much. Merge conflicts should be resolved now.

@@ -55,6 +55,11 @@ pub struct EnumDeserializer<'de> {
input: &'de [(String, Value)],
}

struct UnionDeserializer<'de> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add some Rustdoc.
I know that the other structs here may not have docs, but this is one of the reasons why the SerDe functionality is not well maintained in Avro.

  • what is this struct / why is it needed
  • what are its field for
  • etc.

@@ -230,6 +241,60 @@ impl<'de> de::VariantAccess<'de> for EnumDeserializer<'de> {
}
}

impl<'de> de::EnumAccess<'de> for UnionDeserializer<'de> {
type Error = Error;
type Variant = Self;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure this type is helpful here.
It usage below (Self::Variant) makes it more confusing.
Can we use just Self ?!

const COLLECTION_SERIALIZER_DEFAULT_INIT_ITEM_CAPACITY: usize = 32;
const SINGLE_VALUE_INIT_BUFFER_SIZE: usize = 128;

pub struct DirectSerializeSeq<'a, 's, W: Write> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rustdoc please for all public items!
With as much as possible details because even the current ser.rs and de.rs are hard to maintain.

Signed-off-by: Martin Tzvetanov Grigorov <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants