Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(gossipsub): use Bytes to cut down on allocations #4751

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

joshuef
Copy link
Contributor

@joshuef joshuef commented Oct 27, 2023

Description

Sets up gossipsub to use Bytes internally in place of Vec<u8> for message processing. This should help avoid potentially costly clones over the message as it is processed and published.

Notes & open questions

Change checklist

  • I have performed a self-review of my own code
  • I have made corresponding changes to the documentation (I dont think any are needed, it's all internal?)
  • I have added tests that prove my fix is effective or that my feature works (well.. current tests pass for me. Not sure more are needed?)
  • A changelog entry has been made in the appropriate crates

@joshuef
Copy link
Contributor Author

joshuef commented Oct 27, 2023

I purposefully avoided making sig/key bytes at this point just to keep this slim, but I don't think there'd be any blockers there?

(Also: I debated an API change to just bytes, though I don't think the current change would cause a clone if you already pass Bytes, so maybe it's fine to leave that to the consumer?)

Copy link
Contributor

@thomaseizinger thomaseizinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me. Couple of things:

  • If we just change internals we don't need a changelog entry.
  • You can't change generated code.
  • I am open to changing the API if we keep it as impl Into<Bytes>.

protocols/gossipsub/src/generated/gossipsub/pb.rs Outdated Show resolved Hide resolved
@joshuef
Copy link
Contributor Author

joshuef commented Oct 28, 2023

Thanks @thomaseizinger.

i've updated the PR to:

  • use Into<Bytes>
  • added a changelog for that as it's changing the API
  • removed the changes to the protobuf files and updated elsewhere for that 👍

This should help avoid potentially costly clones over  as it is
processed and published
Copy link
Contributor

@thomaseizinger thomaseizinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Two minor comments :)

protocols/gossipsub/src/behaviour.rs Outdated Show resolved Hide resolved
protocols/gossipsub/CHANGELOG.md Outdated Show resolved Hide resolved
@joshuef
Copy link
Contributor Author

joshuef commented Oct 29, 2023

Okay, updated as per feedback 🙇 .

@mxinden
Copy link
Member

mxinden commented Oct 29, 2023

Thank you for the work here.

Not sure how desirable this is for main, but we've been having quite a bit of mem coming from gossip. We're not transferring wildly large data in the msg either, a few kbs).

Would you mind sharing some numbers before and after? E.g. a heaptrack profile of a node running with and without this patch.

@joshuef
Copy link
Contributor Author

joshuef commented Oct 29, 2023

Abbbbsolutely. Literally doing a bit of heaptrackery the now. 👍

Thank you guys for all the work. I'm happy to wee PR in where I can!

Copy link
Contributor

@thomaseizinger thomaseizinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! A few more suggestions after looking at it closely now :)

protocols/gossipsub/CHANGELOG.md Outdated Show resolved Hide resolved
examples/chat/src/main.rs Outdated Show resolved Hide resolved
protocols/gossipsub/src/behaviour.rs Outdated Show resolved Hide resolved
protocols/gossipsub/src/protocol.rs Outdated Show resolved Hide resolved
protocols/gossipsub/src/transform.rs Show resolved Hide resolved
protocols/gossipsub/src/types.rs Show resolved Hide resolved
@joshuef
Copy link
Contributor Author

joshuef commented Oct 30, 2023

Good catches. Updated 👍 🙇

@joshuef
Copy link
Contributor Author

joshuef commented Oct 30, 2023

@mxinden

Okay, so here is main-no-libp2p-bytes-heaptrack.safenode.4817.zst.zip, looking at allocations there (specifically searching for gossip, there are a gooood amount.

This is roughly the same load in both instances (~1k nodes, same folders being uploaded and so the same quantity of gossip msgs being generated). I've searched through the heaptracks (we have 20 nodes across the network here heaptracking) for worst case examples.

And here with the 3 PRs applied together. libp2p-bytes-updates-heaptrack.safenode.6027.zst.zip

The latter case here actually has has marginally more data go through it as w/ main the network degrades to become unusable after not too long.

This is where I'm comparing allocations. Total mem used / leaked etc is about the same in both instances as you'll see. But I think all the extra allocations causes a fair amount of load that's crippling us here (as we're perhaps sending much more than was originally envisaged over gossip). But allocations wise, the Bytes code performs around 1/3 as many, it seems, and is performing as normal for us now 💪
Screenshot 2023-10-30 at 12 04 43

@thomaseizinger thomaseizinger changed the title fix(gossip): convert gossip data to Bytes. feat(gossipsub): use Bytes to cut down on allocations during message processing Oct 31, 2023
Copy link
Contributor

@thomaseizinger thomaseizinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, thank you!

@thomaseizinger thomaseizinger changed the title feat(gossipsub): use Bytes to cut down on allocations during message processing feat(gossipsub): use Bytes to cut down on allocations Oct 31, 2023
@thomaseizinger
Copy link
Contributor

You could look into patching https://github.com/tafia/quick-protobuf to use Bytes which would likely allow for using even fewer allocations.

@jxs
Copy link
Member

jxs commented Nov 1, 2023

Hi, some Lighthouse nodes have also been OOM'ing with the following jemalloc memory dump
je_prof3

which is probably due to forward_msg cloning the Rpc message for each connection handler (in case of Lighthouse it's usually ~100 connections).
I created a messy (touches the generated protobufs code) hotfix which Arcs the message just to confirm it helps easing the memory usage. I can rework the fix to submit it as a proper PR which along this one should help cutting the allocations.

@thomaseizinger
Copy link
Contributor

Thanks for weighing in here @jxs. @mxinden and I discussed this yesterday and we have some doubts about the optimisations presented in this PR. The issue is, the protobuf files contain owned values of data and thus, sooner or later, we will allocate for each message that is being sent.

This can only be fixed if we start to use borrowed data for the protobuf structs which would actually be great. I think the last time we tried this, it didn't work well with asynchronous-codec but the latest version now uses GATs to allow for encoding of borrowed data meaning this could be an option.

@joshuef With the above reasoning, I cannot explain why you are seeing performance improvement with this PR. Are you sure the two heaptrack snapshots show the same workload? They also show the ConnectionHandler::poll function as the source of the allocations but after looking at the code, I can't see where we would actually be allocating there?

@joshuef
Copy link
Contributor Author

joshuef commented Nov 1, 2023

@thomaseizinger within this PR there are no clone sites, but outwith there are some RawMessage clones that otherwise will duplicate the entire data field there.

It also helps prevent / ease any handling of the messages for the consumer too.

(I agree though, if we got this into the lower layers that'd be even better)

@thomaseizinger
Copy link
Contributor

@thomaseizinger within this PR there are no clone sites, but outwith there are some RawMessage clones that otherwise will duplicate the entire data field there.

But these sites don't show up in your heaptrack screenshots? Perhaps I am misreading it but that mostly shows the codec and the ConnectionHandler, not the behaviour.

That is what makes a bit doubtful that we are optimising the right thing here.

I am definitely on-board with optimising memory usage. I just want to understand and see the improvement :)

@thomaseizinger
Copy link
Contributor

Screenshot 2023-10-30 at 12 04 43

@joshuef Looking in more detail at the code and this screenshots, I am pretty confident I now know where these allocations are coming from. In the current implementation of quick-protobuf-codec, we allocate a Vec for each message, serialize into this vec and then pass it to varint codec which writes it into the BytesMut of the Framed buffer. From there, they get written into the stream.

Framed is designed to reuse a buffer between writes but we don't make use of this and instead allocate an completely new buffer temporarily for writing the message + varint.

I think I improved on this in greatly in #4782. Now, we write to the BytesMut directly which should be reused throughout the lifetime of a Framed. In gossipsub, this Framed is actually long-lived (because the stream is long-lived) and thus there should only be a single allocation for the entire stream during writing (modulo resizings if the current buffer is not big enough for the message).

Currently, these allocations happen in ConnectionHandler::poll because there we call Framed::start_send which internally delegates to the codec which would allocate the temporary Vec.

It would be great if you could test the above PR and check if number of allocations go down :)

@joshuef
Copy link
Contributor Author

joshuef commented Nov 2, 2023

Looking in more detail at the code and this screenshots

Ah right, yeh I was just focussing on the gossip allocations there. You have the two heaptracks entirely in the same message there if you want to deep dive into the actual data.

For me it's about being able to safely handle/consume/clone these as a user, and also within libp2p, I think. As The RawMessage and so one are cloned about in the libp2p code, as well as Records (in the Kad pr) being similarly used in libp2p and cloned a fair bit in our own code.

It would be great if you could test the above PR and check if number of allocations go down :)

I will absolutely have at that now. I assume in isolation from these PRs?

I suspect it'll get allocations down, but we'll still want these changes or similar so clones of RawMessage or Record eg are not so allocation heavy?

@thomaseizinger
Copy link
Contributor

For me it's about being able to safely handle/consume/clone these as a user, and also within libp2p, I think. As The RawMessage and so one are cloned about in the libp2p code, as well as Records (in the Kad pr) being similarly used in libp2p and cloned a fair bit in our own code.
[...]
I suspect it'll get allocations down, but we'll still want these changes or similar so clones of RawMessage or Record eg are not so allocation heavy?

We can definitely land these changes, I'd just like to see them having an impact first :)
I doubt we will see big improvements with the current changes because we are constructing the proto::RPC instances so early in the process.

Looking in more detail at the code and this screenshots

Ah right, yeh I was just focussing on the gossip allocations there. You have the two heaptracks entirely in the same message there if you want to deep dive into the actual data.

I'd love to but ironically, heaptrack segfaults when I try to load these files 🙃

@joshuef
Copy link
Contributor Author

joshuef commented Nov 2, 2023

Also, I was just poking about quick-protobuf w/r/t Bytes, but I dont think it makes sense there. The generated code would require that the consuming app was having a Bytes dep, which I thiiiink is probably pretty poor UX and likely why they've gone Cow. I think a containing type to minimise the actual allocations until the last moment (and which I think you noted we can avoid w/ Cow::borrowed) makes sense 👍

@thomaseizinger
Copy link
Contributor

Also, I was just poking about quick-protobuf w/r/t Bytes, but I dont think it makes sense there. The generated code would require that the consuming app was having a Bytes dep, which I thiiiink is probably pretty poor UX and likely why they've gone Cow. I think a containing type to minimise the actual allocations until the last moment (and which I think you noted we can avoid w/ Cow::borrowed) makes sense 👍

You are right! I came to the same conclusion after experimenting a bit now.

Copy link
Contributor

mergify bot commented Apr 15, 2024

This pull request has merge conflicts. Could you please resolve them @joshuef? 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants