-
Notifications
You must be signed in to change notification settings - Fork 976
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
swarm: optimize compute of locally supported protocols #4284
Comments
@thomaseizinger thank you for writing this up. Can you add a summary on why compute of locally supported protocols is slow, i.e. needs to be optimized? |
Is the flamegraph in #3840 not enough? I don't want to re-iterate the entire discussion but only wanted to track one particular action item in an issue here so it can be picked up by someone. |
I propose a fix here: #4291 I believe my changes perform functionally the same as the outlined algorithm above, but they're simpler. |
The flamegraph shows that
In my eyes we are fixing the symptom before understanding the root cause. In addition I consider simply introducing a timer, to call the expensive part less, a hack. Can we do better? How do we determine the ideal duration of the timer? Is the complexity worth it? Can this be done simpler? I know answering all of this is a lot of work and I appreciate the time you spend on this @alindima and @thomaseizinger. |
I don't think it's only about how expensive I agree that having a timer is a hack, but it looked like it was the only proposed solution deemed acceptable. Some of the questions here are still important, I'll see what I can do about them |
@alindima friendly ping. Does this issue still exist for Polkadot? Can you follow up on these questions from above?
Shall we do some synchronous debugging on this issue together? I am happy to do a call this or next week. I tried to reproduce the issue using https://github.com/mxinden/kademlia-exporter/ establishing a couple thousand connections to the IPFS network. Note I don't think it is a good comparison to Polkadot, given that it has a lot of connections, a lot of connection churn, only minimal data send per connection and only supports the protocols below:
For what it is worth, I don't see significant time spend in code region discussed here. rust-libp2p/swarm/src/connection.rs Lines 451 to 452 in c8b5f49
I see 0.34% spend in Attaching the flamegraph for the sake of completeness. Again, I don't think it is representative. |
Once we have #4282, we should be able to gather more precise data about where the time is spent, how often the connection is being woken etc. |
sorry, I haven't yet managed to get the time to invest in this further. It's definitely still an issue for polkadot, but we think we have bigger fish to fry currently in terms of network overhead. I view this as a performance regression, but even without it, the CPU consumption is very high |
I just discovered another reason why we have a lot of allocations here: rust-libp2p/swarm/src/connection.rs Line 462 in 22f70e1
The contract between the upgrade and the connection is currently |
For context, see this discussion: #3840.
This issue captures the idea mentioned in #3840 (reply in thread) as a concrete task to work on.
What we want to achieve is a consistent view of our locally supported protocols by the
ConnectionHandler
. Every state change in theConnectionHandler
may change the supported protocols (i.e. what is returned fromlisten_protocol
).As the linked discussion shows, computing this on every poll iteration is quite expensive. Skipping the computation on some iterations may risk that the view is outdated. The key requirement for the optimization is thus that we must ensure that if we return
Poll::Pending
from the task AND our view is potentially stale to eventually recompute it.The concrete suggestion in the linked issue is:
maybe_outdated
.false
every time we update the supported protocols.true
on every iteration where we decide to skip the computation.Poll::Pending
andmaybe_outdated
is set totrue
, register a timer to wake us up in 5 seconds.In addition, I'd suggest that we also compute the difference to our supported protocols upon every inbound stream. We need to collect all protocols anyway at this point so we might as well use that data efficiently!
I believe that this algorithm has the following properties:
Connection
don't end up re-computing the supported protocols.The text was updated successfully, but these errors were encountered: