Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CborCodec: bytes missing when decoding received messages #6

Closed
sandreae opened this issue Jul 4, 2023 · 6 comments
Closed

CborCodec: bytes missing when decoding received messages #6

sandreae opened this issue Jul 4, 2023 · 6 comments

Comments

@sandreae
Copy link
Contributor

sandreae commented Jul 4, 2023

We're using CborCodec in the handler for a custom network behaviour here and I'm observing some unexpected behaviour which I think is originating somewhere in this crate.

The issue I'm seeing is that bytes encoded and sent from one peer don't arrive in their entirety on the receiving end. By adding logging to decode and encode in CborCodec I was able to observe more closely what is occurring.

Peer A: logging from within encode on CborCodec

These bytes make up one complete message of our protocol, they're encoded on peer A and sent to Peer B.

=== TIME SystemTime { tv_sec: 1688365471, tv_nsec: 119830695 } ===
=== BUFFER ===
b"\x83\x0b\0\x8d\x82xD0020ca1d2b85567443b2db42bba286c07fc64b1a46f2d3e67ba819550b09e18eccd0\x81xD0020ca1d2b85567443b2db42bba286c07fc64b1a46f2d3e67ba819550b09e18eccd0\x82xD00207c0dec3322d78efc419432c4a215b3635cb17b3cf20674ff47ec3aed0fa3f24e\x81xD00207c0dec3322d78efc419432c4a215b3635cb17b3cf20674ff47ec3aed0fa3f24e\x82xD0020c3560ffcfa50f14ab01740cd8f832e0fb35035eeb8cd4aca124bac2d42a47a2a\x81xD0020c3560ffcfa50f14ab01740cd8f832e0fb35035eeb8cd4aca124bac2d42a47a2a\x82xD002099dd2c7a77ce59fddb48bcfd18baefc73888849436d471d4aba40aa8f027e13a\x81xD002099dd2c7a77ce59fddb48bcfd18baefc73888849436d471d4aba40aa8f027e13a\x82xD00209f6b09271e1a71ae118d1db23247e3f9365a067d05a7c600f6294efe2853915a\x81xD00209f6b09271e1a71ae118d1db23247e3f9365a067d05a7c600f6294efe2853915a\x82xD0020d0d77bc953af8d55266e3e799f5028bb795a231075c1611394cabe471c0ffb60\x81xD0020d0d77bc953af8d55266e3e799f5028bb795a231075c1611394cabe471c0ffb60\x82xD002070f0b4df30e70117f2e33c32ad3f8a5dad01f6619c9c0a822ddb8b59eecc2c1b\x81xD002070f0b4df30e70117f2e33c32ad3f8a5dad01f6619c9c0a822ddb8b59eecc2c1b\x82xD00207240a08a925afbb93cb3a5ccebfef0cc031781bcfba8d8efc943389242d8fbf0\x81xD00207240a08a925afbb93cb3a5ccebfef0cc031781bcfba8d8efc943389242d8fbf0\x82xD00207b1f675a0c5c810e1983734a7e2735ae9fc9e9902897c6a518e61533bfa2a199\x81xD00207b1f675a0c5c810e1983734a7e2735ae9fc9e9902897c6a518e61533bfa2a199\x82xD0020a222cd828dc0a23854794da35128a3e5cf7ed5bc23a2391da1a7b3a857f87d87\x81xD0020a222cd828dc0a23854794da35128a3e5cf7ed5bc23a2391da1a7b3a857f87d87\x82xD0020befe8ce6aa710223f4e714c0a46e9159c95e5f881cc470b6837e2c168f7b2c0f\x81xD0020befe8ce6aa710223f4e714c0a46e9159c95e5f881cc470b6837e2c168f7b2c0f\x82xD0020d94d8d0044d4364c1cb7625b07dd6d57e4e48690088a3216a7c836cd1bdaa8c4\x81xD0020d94d8d0044d4364c1cb7625b07dd6d57e4e48690088a3216a7c836cd1bdaa8c4\x82xD00200b9458f7190832f74b37644edd3b5462cce2a4129a1a4541afa90750c323d016\x81xD00200b9458f7190832f74b37644edd3b5462cce2a4129a1a4541afa90750c323d016"

Peer B: logging from within decode on CborCodec

These bytes arrive on peer B, they are roughly the first half of the bytes encoded above.

=== TIME SystemTime { tv_sec: 1688365471, tv_nsec: 120452065 } ===
=== DECODE BYTES ===
b"\x83\x0b\0\x8d\x82xD0020ca1d2b85567443b2db42bba286c07fc64b1a46f2d3e67ba819550b09e18eccd0\x81xD0020ca1d2b85567443b2db42bba286c07fc64b1a46f2d3e67ba819550b09e18eccd0\x82xD00207c0dec3322d78efc419432c4a215b3635cb17b3cf20674ff47ec3aed0fa3f24e\x81xD00207c0dec3322d78efc419432c4a215b3635cb17b3cf20674ff47ec3aed0fa3f24e\x82xD0020c3560ffcfa50f14ab01740cd8f832e0fb35035eeb8cd4aca124bac2d42a47a2a\x81xD0020c3560ffcfa50f14ab01740cd8f832e0fb35035eeb8cd4aca124bac2d42a47a2a\x82xD002099dd2c7a77ce59fddb48bcfd18baefc73888849436d471d4aba40aa8f027e13a\x81xD002099dd2c7a77ce59fddb48bcfd18baefc73888849436d471d4aba40aa8f027e13a\x82xD00209f6b09271e1a71ae118d1db23247e3f9365a067d05a7c600f6294efe2853915a\x81xD00209f6b09271e1a71ae118d1db23247e3f9365a067d05a7c600f6294efe2853915a\x82xD0020d0d77bc953af8d55266e3e799f5028bb795a231075c1611394cabe471c0ffb60\x81xD0020d0d77bc953af8d55266e3e799f5028bb795a231075c1611394cabe471c0ffb60\x82xD002070f0b4df30e70117f2e33c32ad3f8a5dad01f6619c9c0a822ddb8b59eecc2c1b\x81xD002070f0b4df30e70117f2e33c32ad3f8a5dad01f6619c9c0a822ddb8b59eecc2c1b\x82xD00207240a08a925afbb93cb3a5ccebfef0cc031781bcfba8d8efc943389242d8fbf0\x81xD00207240a08a925afbb93cb3a5ccebfef0cc031781bcfba8d8efc943389242d8fbf0\x82xD00207b1f675a0c5c810e1983734"
=== DECODE EOF ERR ===
ErrorImpl { code: EofWhileParsingValue, offset: 1170 }

The next set of bytes arrive, these seem to be the final few bytes from the incomplete message bytes received above.

=== TIME SystemTime { tv_sec: 1688365471, tv_nsec: 120844452 } ===
=== DECODE BYTES ===
b"00207b1f675a0c5c810e1983734"
=== DECODE ERR ===
ErrorImpl { code: Message("invalid type: integer `-17`, expected p2panda replication message"), offset: 0 }

This occurs only sometimes, but has been observed on multiple systems. We wondered if it could be related to max buffer sizes being hit, but haven't been able to come up with a concrete theory here.

Solution attempts

Looking at decode on CborCodec I wondered if the error handling was correct. If I add a return in the eof error case (as seen below) then the above issue doesn't occur anymore. But I'm not confident this is what was causing the issue or now we just hide the problem.

/// Decoder impl parses cbor objects from bytes
impl<Enc, Dec> Decoder for CborCodec<Enc, Dec>
where
    for<'de> Dec: Deserialize<'de> + std::fmt::Debug + 'static,
    for<'de> Enc: Serialize + std::fmt::Debug + 'static,
{
    type Item = Dec;
    type Error = CborCodecError;

    fn decode(&mut self, buf: &mut BytesMut) -> Result<Option<Self::Item>, Self::Error> {
        println!("=== TIME {:?} ===", std::time::SystemTime::now());
        println!("=== DECODE BYTES ===");
        println!("{:?}", buf);

        // Build deserializer
        let mut de = serde_cbor::Deserializer::from_slice(&buf);

        // Attempt deserialization
        let res: Result<Dec, _> = serde::de::Deserialize::deserialize(&mut de);

        // If we ran out before parsing, return none and try again later
        let res = match res {
            Ok(v) => {
                println!("=== DECODE OK ===");
                println!("{:?}", v);
                Ok(Some(v))
            }
            Err(e) if e.is_eof() => {
                println!("=== DECODE EOF ERR ===");
                println!("{:?}", e);
                return Ok(None) // Added return here            
            }
            Err(e) => {
                println!("=== DECODE ERR ===");
                println!("{:?}", e);
                Err(e.into())
            }
        };

        // Update offset from iterator
        let offset = de.byte_offset();

        // Advance buffer
        buf.advance(offset);

        res
    }
}

Minimal reproducible example....

Unfortunately I don't have a simple example to share, If that's the only way to progress then I'm happy to create one, it's non-trivial work though so I'm opening this issue in case there is a something to be done already.

@sandreae sandreae changed the title Bytes missing when decoding received messages CborCodec: bytes missing when decoding received messages Jul 4, 2023
@adzialocha
Copy link

adzialocha commented Jul 5, 2023

Adding the return when we're prematurely reaching the end of the input data seems right, otherwise we would increase the offset of the buffer and look only at the second part of the data next time (which would be as "broken" as the first part).

One would think that the offset should not increase when calling byte_offset after an error occurred .. Could you also log the offset @sandreae before and after deserialize is called to see if it advances anyhow?

@mxinden
Copy link
Owner

mxinden commented Jul 11, 2023

I am sorry for the late reply here. Thanks for reporting the issue.

Looking at decode on CborCodec I wondered if the error handling was correct. If I add a return in the eof error case (as seen below) then the above issue doesn't occur anymore. But I'm not confident this is what was causing the issue or now we just hide the problem.

You are right. We should not advance buf in case of Ok(None). Good catch!

Unfortunately I don't have a simple example to share, If that's the only way to progress then I'm happy to create one, it's non-trivial work though so I'm opening this issue in case there is a something to be done already.

How about the following unit test:

  1. Encode some larger struct into a Vec<u8>.
  2. Create a BytesMut with half of the above Vec<u8>.
  3. Expect a call to <CborCodec as Decoder>::decode with the BytesMut to return Ok(None).
  4. Add the remaining half of the bytes from the Vec<u8> into BytesMut.
  5. Expect a call to <CborCodec as Decoder>::decode with the BytesMut to return Ok(Some(your_struct)).

One would think that the offset should not increase when calling byte_offset after an error occurred .. Could you also log the offset @sandreae before and after deserialize is called to see if it advances anyhow?

In case of an Err I would assume that <CborCodec as Decoder>::decode is not called again, thus I don't think it matters whether buf.advance is called. That said, certainly won't hurt to not call it.

@sandreae would you mind creating a pull request with a fix and the above described unit test? Let me know in case you need any help.

@adzialocha
Copy link

adzialocha commented Jul 11, 2023

Ah great! Thanks for confirming, we will make a PR.

In case of an Err I would assume that ::decode is not called again, thus I don't think it matters whether buf.advance is called. That said, certainly won't hurt to not call it.

Yes, that's right, though I didn't mean the critical errors but the EOF "errors" aka Err(e) if e.is_eof().

@sandreae
Copy link
Contributor Author

Hi @mxinden, thanks for looking into this and the suggestions, it was a bit of a head-scratcher so I'm glad we got to the bottom of it.

I can make a PR with the change and unit test in the next few days.

@sandreae
Copy link
Contributor Author

Opened a PR here: #7 👍

@mxinden
Copy link
Owner

mxinden commented Jul 24, 2023

With #7 merged, closing. here. Thank you for the help!

@mxinden mxinden closed this as completed Jul 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants