-
Notifications
You must be signed in to change notification settings - Fork 134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RUST-1132 Implement DeserializeSeed
for owned and borrowed raw documents
#433
Conversation
@@ -370,3 +372,32 @@ impl Undefined { | |||
} | |||
} | |||
} | |||
|
|||
|
|||
#[derive(Debug, Deserialize)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these are copied as-is from the original raw bson visitor
}; | ||
|
||
/// A visitor used to deserialize types backed by raw BSON. | ||
pub(crate) struct OwnedOrBorrowedRawBsonVisitor; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The contents in this file are mostly copied as-is from src/raw/serde.rs
; I will point out the differences below.
.into()) | ||
} | ||
|
||
fn visit_seq<A>(self, seq: A) -> Result<Self::Value, A::Error> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method now uses the seeded visitor to deserialize a sequence.
let doc = RawDocument::from_bytes(bson).map_err(SerdeError::custom)?; | ||
Ok(RawBsonRef::Array(RawArray::from_doc(doc)).into()) | ||
} | ||
_ => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This case also uses the seeded deserializer. The extjson logic above is the same (except for the models being moved to a different module).
src/raw/serde/seeded_visitor.rs
Outdated
fn finish_document(&mut self, index: usize) -> Result<(), String> { | ||
self.buffer.push_byte(0); | ||
|
||
let length_bytes = match i32::try_from(self.buffer.len() - index) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opted only to validate the cast to i32
here despite the fact that some nested types also calculate/append length bytes. If any of those nested lengths overflow, the larger length that includes them here will also overflow, so no need to double-check.
Some(element_type) => element_type, | ||
None => { | ||
// Remove the additional key and padding for the element that was not present. | ||
self.buffer.drain(element_type_index..self.buffer.len()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't ideal but SeqAccess
does not contain any has_next
or required length method, so there's no way to know whether there's another element until next_element_seed
is called, which has to happen after the key and element type byte are already appended.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only way around this I can think of would be to create a temporary buffer for the element type tag, key string, and value, and only append that temp to the main buffer if there actually is one. I'm pretty sure that the way you have it will be better performance-wise, though.
I've done some investigation into merging the visitors (which I think is a good idea), but I've run into a few problems that make this more complicated than expected. The crux of the issue is that the Take, for example, an attempt to deserialize a borrowed string into This is definitely not an intractable problem. The Wanted to sanity check here before sinking another day or two into this work, which has already taken me longer than expected 🙂 @abr-egn WDYT about not unifying the visitors right now and possibly doing so in the future as part of a larger cleanup of this library's internals? |
That makes sense to me. Thank you for taking the time to investigate this path! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Some(element_type) => element_type, | ||
None => { | ||
// Remove the additional key and padding for the element that was not present. | ||
self.buffer.drain(element_type_index..self.buffer.len()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only way around this I can think of would be to create a temporary buffer for the element type tag, key string, and value, and only append that temp to the main buffer if there actually is one. I'm pretty sure that the way you have it will be better performance-wise, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
I left a suggestion to consider replacing CowByteBuffer
to simplify. But I do not consider it a necessary change to merge.
This PR adds a new
SeededVisitor
type to deserialize into raw documents/arrays using a single buffer. There are no tests added as the existing corpus/serde tests already provide coverage for deserializing into these types. I did update the test runners in this library and in the driver (for which I will make a separate PR) to deserialize test JSON into raw documents for good measure, which, in addition to adding a bit more testing of this implementation, should be a small performance improvement for spec tests.Some very basic benchmarking shows a 50% speed improvement when deserializing a JSON object with 80 layers of object nesting into a
RawDocumentBuf
. I will follow up with some more thorough benchmarking.EDIT: Further benchmarking against the
deep_bson.json
andflat_bson.json
files generated for the BSON microbenchmarks consistently shows that this new implementation is about three times faster than the existing implementation when deserializing those files into aRawDocumentBuf
.