-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Direct serializer for "Serde Way" #139
base: main
Are you sure you want to change the base?
Conversation
Thanks for the PR! |
e2b0d22
to
422fbaa
Compare
Oops, I hadn't realized my branch had fallen behind that much. Merge conflicts should be resolved now. |
Signed-off-by: Martin Tzvetanov Grigorov <[email protected]>
@@ -55,6 +55,11 @@ pub struct EnumDeserializer<'de> { | |||
input: &'de [(String, Value)], | |||
} | |||
|
|||
struct UnionDeserializer<'de> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add some Rustdoc.
I know that the other structs here may not have docs, but this is one of the reasons why the SerDe functionality is not well maintained in Avro.
- what is this struct / why is it needed
- what are its field for
- etc.
@@ -230,6 +241,60 @@ impl<'de> de::VariantAccess<'de> for EnumDeserializer<'de> { | |||
} | |||
} | |||
|
|||
impl<'de> de::EnumAccess<'de> for UnionDeserializer<'de> { | |||
type Error = Error; | |||
type Variant = Self; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure this type is helpful here.
It usage below (Self::Variant) makes it more confusing.
Can we use just Self
?!
const COLLECTION_SERIALIZER_DEFAULT_INIT_ITEM_CAPACITY: usize = 32; | ||
const SINGLE_VALUE_INIT_BUFFER_SIZE: usize = 128; | ||
|
||
pub struct DirectSerializeSeq<'a, 's, W: Write> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rustdoc please for all public items!
With as much as possible details because even the current ser.rs and de.rs are hard to maintain.
Signed-off-by: Martin Tzvetanov Grigorov <[email protected]>
Hi, I started playing with avro-rs to build some Rust-based Avro tools, and noticed some things I could take a stab at improving in the avro-rs library itself. This PR has a lot of changes in it, but it mostly centers around a new
DirectSerializer
, who's behavior is described below. Feel free to consider this PR as a whole, or take bits and pieces from it. There are some things in it that could be useful even independently of the addition ofDirectSerializer
:This PR proposes a direct serialization of objects when serializing the "Serde Way", which provides roughly a 5x performance improvement, and fixes issues caused by lost schema information when first converting to a
Value
before serializing to the Write stream (#70).Some notes on these changes:
DirectSerializer
, provides direct serialization of a type using the "Serde Way". TheDirectSerializer
is initialized with the schema to be used for writing data, and uses that schema as a guide to know how to serialize the serde types that it encounters.Value
types, which is only part of the process in the "Serde Way", so some additional benchmark tests were added to measure the end-to-end performance of "Serde Way" serialization#[serde(untagged)]
attribute is added, which serializes an enum value as the variant type itself rather than a union type, and it's up to the (de)serializer to detect that the schema calls for a union, search the union schemas for a match, and use the first schema found that matches the data. This PR adds aUnionDeserializer
tode.rs
, which, in combination withDirectSerializer
, allows "Avro union / Rust enum" (de)serialization without requiring the#[serde(untagged)]
attribute.Value
as an intermediate step, but one thing I was not able to replicate is serialization of enums that do have the#[serde(untagged)]
attribute. The closest I got was making theDirectSerializer
able to search a union schema for the first schema type that matches the input data, but in the case of "record" types, there's no easy way to ensure that the field names are also a match. This feature of schema matching on write does seem a bit odd to me, since the writer should know the structure of the data being written, and schema matching should only really be a concern for the reader, where schema resolution is happening between a reader and writer schema. But, worth noting that this difference in behavior is not backwards-compatible.All unit tests are passing. You can take a look at the changes in the unit tests to see where there are differences in behavior.