Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Autosharding v1 #1846

Closed
3 tasks done
SionoiS opened this issue Jul 6, 2023 · 14 comments · Fixed by #1857
Closed
3 tasks done

Autosharding v1 #1846

SionoiS opened this issue Jul 6, 2023 · 14 comments · Fixed by #1857
Assignees
Labels
E:1.2: Autosharding for autoscaling See https://github.com/waku-org/pm/issues/65 for details

Comments

@SionoiS
Copy link
Contributor

SionoiS commented Jul 6, 2023

Planned start date: 28 Jun 2023
Due date: 18 August 2023

Summary

Design and implementation of autosharding. Autosharding is the automatic assignment of content topics to shards.

Acceptance Criteria

  • The autosharding function distribute topics to shards.
  • The ENR of node using autoshard is created/updated with the shards info.
  • Nodes can be configured for content topics and/or pubsub topics.
  • The FILTER protocol code has been updated for autosharding.
  • The LIGHTPUSH protocol code has been updated for autosharding.
  • The RFC has been updated with the new specifications.

Tasks

RAID (Risks, Assumptions, Issues and Dependencies)

Autosharding will affect various other protocols and may have unintended consequences.

#1846 (comment)

@chaitanyaprem
Copy link
Contributor

chaitanyaprem commented Aug 3, 2023

@SionoiS , @alrevuelta Was going through peer manager code and noticed that there seems to be a limit of 1 connection towards a peer as per libp2p.
I am thinking we will have to improve peer management to prune/manage connections towards peers considering shards as well. Basically a more intelligent pruning.
Otherwise a node may end up not having connections to a shard at all as connection limit is reached.
There are 2 ways to address this problem:

  1. Open a separate connection towards same peer for each shard. This would be simpler, as we could just enforce peer/connectionLimits per shard. But i have noticed that in nwaku there is a setting to limit 1 connection per peer.
    const MaxConnectionsPerPeer* = 1
    . Any specific reason for limiting this?
  2. Another alternative is share the single connection between peers for communicating across multiple shards. In this case we may have to ensure that a min relayConnections are maintained per shard the node is interested in. Not sure how we would decide percentage of peers to be maintained per shard though.

Not sure if this is the right place to bring this up, because this issue exists currently with static sharding as well.
But since auto-sharding is being done, thought of bringing it up here.

Thoughts.

@SionoiS
Copy link
Contributor Author

SionoiS commented Aug 3, 2023

That is a good point! @chaitanyaprem In theory It's possible, my question would be how often does that happen in practice?

I don't think we want multiple connections per peer. Although, I'm not as familiar with this part of the code than @alrevuelta

I'll have to investigate further.

@SionoiS SionoiS changed the title feat: Autosharding v1 [Milestone] Autosharding v1 Aug 4, 2023
@SionoiS
Copy link
Contributor Author

SionoiS commented Aug 4, 2023

Weekly Update

  • achieved: feedback/update cycles for FILTER & LIGHTPUSH
  • next: New fleet, updating ENR from live subscriptions and merging
  • blocker: Architecturally it seams difficult to send the info to Discv5 from JSONRPC for the Waku app.

@chaitanyaprem
Copy link
Contributor

chaitanyaprem commented Aug 7, 2023

That is a good point! @chaitanyaprem In theory It's possible, my question would be how often does that happen in practice?

Yeah, even i couldn't think of a possible scenario. But i am guessing it would depend on the order in which peer connections get established. Since that order is hard to determine for a node, this can happen on any node while coming up or restarting or just during regular peer connection pruning itself.

@alrevuelta
Copy link
Contributor

I am thinking we will have to improve peer management to prune/manage connections towards peers considering shards as well. Basically a more intelligent pruning

Yup. The minimum we would need is to only connect to peers in shards that we are interested in. We can know this using the ENR field indicating the shard the node is subscribed to. Note that this can be faked but afaik we can double check this once the connections is done. In other words:

  • Discover nodes
  • Get shards each node is part of (using ENR)
  • If node is part of shard we are interested
    • Connect to node (unless other existing criteria is not fulfilled, eg in/out ratio, score, etc)
  • If not
    • Skip

Note that this has to be bypassed if someone configures the node will all-shards (or whatever name). In this case we will attempt to connect to all nodes, but trying to keep a balance between shards.

nimbus does something similar here

Since subscriptions are dynamic (eg one can unsubscribe from the RPC) we need some prunning. Eg, we are subscribed to a given content topic (that maps to gossipsub topic X). Using the RPC we unsubscribe from that content topic, and we dont have any active subscription in topic X. In that case we need to prune relay connections to that node.

But i have noticed that in nwaku there is a setting to limit 1 connection per peer

I don't see any reason for changing this by now. In any case imho this falls out of scope for sharding, so we can reasses it but in another milestone.

@SionoiS Can we add this peer management task to this milestone? Without it autosharding won't be really completed.

Also, as discussed during the offsite, we have the concept of "altruistic sharding" (we need a better name). In other words, your node will be connected to the pubsub topics you are interested in (derived from the content topcs you are subscribed) BUT also to (1?) extra random? shard that most likely will keep changing (this is the altruistic shard, because you are not really interested in that one). This is to ensure a good coverage of the shards. Still to be designed. Adding it here #1892. Perhaps it should be part of this sharding milestone @SionoiS ?

@SionoiS
Copy link
Contributor Author

SionoiS commented Aug 8, 2023

@SionoiS Can we add this peer management task to this milestone? Without it autosharding won't be really completed.

@alrevuelta Yes I agree. If we don't fix this, autosharding can silently fail which is very bad UX.

Also, as discussed during the offsite, we have the concept of "altruistic sharding" (we need a better name). In other words, your node will be connected to the pubsub topics you are interested in (derived from the content topcs you are subscribed) BUT also to (1?) extra random? shard that most likely will keep changing (this is the altruistic shard, because you are not really interested in that one). This is to ensure a good coverage of the shards. Still to be designed. Adding it here #1892. Perhaps it should be part of this sharding milestone @SionoiS ?

Nice to have but not required IMO. Maybe add another autosharding milestone for updates like this, fixes and "shard space" redesign?

@chaitanyaprem
Copy link
Contributor

Note that this has to be bypassed if someone configures the node will all-shards (or whatever name). In this case we will attempt to connect to all nodes, but trying to keep a balance between shards.

nimbus does something similar here

Since subscriptions are dynamic (eg one can unsubscribe from the RPC) we need some prunning. Eg, we are subscribed to a given content topic (that maps to gossipsub topic X). Using the RPC we unsubscribe from that content topic, and we dont have any active subscription in topic X. In that case we need to prune relay connections to that node.

Not just pruning, we may have to relax/increase number of max-connections or relay connections based on number of shards a node is subscribed to or rebalance connections across shards.
Consider a special case, not a normal one. Say a node starts with 50 as max-connections and is subscribed to 4 shards which makes having 10 connections per shard(assuming each node is only supporting 1 shard). If the node subscribes to a new shard(for which existing 40 peers are not subscribed to), then there won't be any room to open connections to this shard.
Such scenarios can be addressed if we ensure x healthy relay connections are maintained per shard(x in between 6-12)

But i have noticed that in nwaku there is a setting to limit 1 connection per peer

I don't see any reason for changing this by now. In any case imho this falls out of scope for sharding, so we can reasses it but in another milestone.

That is fine, this can be looked up at a later stage.

@SionoiS Can we add this peer management task to this milestone? Without it autosharding won't be really completed.

Also, as discussed during the offsite, we have the concept of "altruistic sharding" (we need a better name). In other words, your node will be connected to the pubsub topics you are interested in (derived from the content topcs you are subscribed) BUT also to (1?) extra random? shard that most likely will keep changing (this is the altruistic shard, because you are not really interested in that one). This is to ensure a good coverage of the shards. Still to be designed. Adding it here #1892. Perhaps it should be part of this sharding milestone @SionoiS ?

As @SionoiS suggested, this also can be taken up as part of a later milestone and not part of this one.

@SionoiS
Copy link
Contributor Author

SionoiS commented Aug 11, 2023

Weekly Update

  • achieved: many feedback/update cycles for FILTER, LIGHTPUSH, STORE & RFC
  • next: updating ENR for live subscriptions

@SionoiS
Copy link
Contributor Author

SionoiS commented Aug 11, 2023

Not just pruning, we may have to relax/increase number of max-connections or relay connections based on number of shards a node is subscribed to or rebalance connections across shards.
Consider a special case, not a normal one. Say a node starts with 50 as max-connections and is subscribed to 4 shards which makes having 10 connections per shard(assuming each node is only supporting 1 shard). If the node subscribes to a new shard(for which existing 40 peers are not subscribed to), then there won't be any room to open connections to this shard.
Such scenarios can be addressed if we ensure x healthy relay connections are maintained per shard(x in between 6-12)

At the GossipSub lvl each topic should have 4-12 full message peers and many more Metadata only peers. I didn't see in the code a way to distinguish between peers based on topic or type (full/metadata). Having a connection hard limit seams like a bad idea in this context.

Relay is built on top of GossipSub and I don't think we can really modify libp2p. I don't see a good way forward yet. I'm still 🤔

@chaitanyaprem
Copy link
Contributor

chaitanyaprem commented Aug 14, 2023

At the GossipSub lvl each topic should have 4-12 full message peers and many more Metadata only peers. I didn't see in the code a way to distinguish between peers based on topic or type (full/metadata). Having a connection hard limit seams like a bad idea in this context.

My mistake i mentioned 6-12, agreed that it has to be 4-12 full-message peers.
Valid point, it slipped my mind about metadata peers. Agree that if we have to enforce connection limits, we may have to probably do for only full peers. But also have some sort of limits for metadata peers so that node's resources are not over-utilized.
I do see the list of mesh(full-message) peers stored in the gossipsub object per pubsub topic. https://github.com/status-im/nim-libp2p/blob/d6263bf751f5552eadb236c51f053f35d59e376f/libp2p/protocols/pubsub/gossipsub/types.nim#L156C116-L156C116 .
Not sure how to get the metadata only peers.

For now, it maybe ok to apply connection limits based for full-message peerings. We may have to run larger node simulations in order to see the issues we may face with current approach and arrive at a number for metadata peers.
Will keep thinking on this as well.

@alrevuelta Thoughts?

Relay is built on top of GossipSub and I don't think we can really modify libp2p. I don't see a good way forward yet. I'm still 🤔

I don't think we need to modify libp2p, but rather apply our limits on top of it.

@SionoiS
Copy link
Contributor Author

SionoiS commented Aug 14, 2023

I don't think we need to modify libp2p, but rather apply our limits on top of it.

I talked with @alrevuelta and we can indeed modify nim-libp2p to get easier access to the data we need.

I still not sure about having a limit. If GossipSub manage 4-12 full-msg peers per topic. It was designed this way as this is what works best.

Maybe only limiting metadata peers makes more sense?
I still don't fully understand how and with who a node will connect. I have some more code delving to do...

@chaitanyaprem
Copy link
Contributor

I don't think we need to modify libp2p, but rather apply our limits on top of it.

I talked with @alrevuelta and we can indeed modify nim-libp2p to get easier access to the data we need.

Ah, i keep forgetting we are maintaining nim-libp2p. In this case yes, for go-waku we will have to implement it outside libp2p.

I still not sure about having a limit. If GossipSub manage 4-12 full-msg peers per topic. It was designed this way as this is what works best.

True, then for what other reason are we having this relay connection limit check? We may have to understand the reason behind it. Also, we need to check if gossipsub is tearing down connections beyond specified limit of full-message peers. And we also need to verify the behaviour for fan-out as well.

Maybe only limiting metadata peers makes more sense? I still don't fully understand how and with who a node will connect. I have some more code delving to do...

This would definitely constrain resource usage, otherwise a node may get flooded with connections and not able to protect itself.

@SionoiS
Copy link
Contributor Author

SionoiS commented Aug 16, 2023

I removed both live subbing and shard peer management task. These task will be part of another milestone or epic.

@github-project-automation github-project-automation bot moved this from In Progress to Done in Waku Aug 17, 2023
@SionoiS
Copy link
Contributor Author

SionoiS commented Aug 17, 2023

Weekly Update

achieved: Complete! FILTER, LIGHTPUSH and RFC merged.

@fryorcraken fryorcraken added E:1.2: Autosharding for autoscaling See https://github.com/waku-org/pm/issues/65 for details and removed E:2023-1mil-users milestone Tracks a subteam milestone labels Sep 8, 2023
@fryorcraken fryorcraken changed the title [Milestone] Autosharding v1 Autosharding v1 Sep 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
E:1.2: Autosharding for autoscaling See https://github.com/waku-org/pm/issues/65 for details
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants