-
Notifications
You must be signed in to change notification settings - Fork 135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix broken bridge stats reporting & add filter-clients flag #1493
Conversation
0a09161
to
3a306ed
Compare
3a306ed
to
5af02b1
Compare
portal-bridge/src/cli.rs
Outdated
|
||
impl From<&Enr> for ClientType { | ||
fn from(enr: &Enr) -> Self { | ||
if let Some(client_id) = UtpEnr(enr.clone()).client() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this doesn't have anything to do with uTP, this usage stands out as strange. Should it be renamed? Should we extract enr parsing logic into a new more-generic struct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Enr's shouldn't hold client_id in a month or two, so whatever is chosen it shouldn't be the worst
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now we're really going on a tangent, but since we started: I can see some downsides to this direction of pinging for metadata, like we would have a new funny class of peer that is connected, but we don't have enough information to interact with them yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but we don't have enough information to interact with them yet.
I am not sure what this means. We shouldn't need to know which implementation we are talking to interop with other clients.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't need to know which implementation we are talking to interop with other clients.
We shouldn't, but clearly it is the most expedient option at times, like in this PR. I'm all for changing to eventually doing peer scoring and all that. It will just be more painful to use this PRs approach until then (and maybe there are other similar scenarios yet to be discovered).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed using UtpEnr
here doesn't track that well. Enr
is a "foreign" type, so it's not so simple to implement client()
directly on it. I just copied the logic over since this might be changed soon, but regardless of where we go with ping's in the future, we'll be able to recognize that this logic also needs updating if we remove ENR_PORTAL_CLIENT_KEY
@@ -256,7 +278,7 @@ impl Network { | |||
None | |||
} | |||
}) | |||
.take(ENRS_RESPONSE_LIMIT) | |||
.take(self.enr_offer_limit) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I guess whatever order peers
is in is the order that we tend to prefer offering content to. Maybe we should shuffle (or partial_shuffle) before taking here, in order to push to a broader mix of peers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, the order in this case should always be the same. Sorted by closest to content id. Each census has the same view of the network (eg 100% of peers), so each census will be gossiping the same content to the same peers. Which is not great. But. Eventually, the state bridges (as we "scale" horizontally) will only gossip content that is close their own node id. In which case, we lose this redundancy but maintain the feature (which is desired) of gossiping content to their closest nodes on the network (which would be lost if we shuffle). I guess, partial_shuffle
is an option... and would mitigate the grey area that we're currently in (aka until we're horizontally scaled). But I'm not sure it's that important overall... though I could be convinced otherwise
If ultralight can't perform maybe they are not ready for mainnet.
I think a long-term solution to this problem will be a reputation system in which, instead of banning an implementation if a client sends us errors, we put them on a block list for X period of time. In any case, it is fairly trivial to write a script to populate all of our State node databases with the first 1 million blocks locally. So if bridging doesn't work out before Devcon, we will still be safe. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really don't like how we need a flag to block connections with a certain implementation.
If that is the case I think
- they shouldn't run their clients on mainnet, because they aren't ready and it is in bad faith
- or we build our client to be resilient to bad actors, which I don't think this PR does.
I think it would be easier to build a client resistant to bad actors than a short-term solution like this especially as we aren't in a rush to meet deadlines if we need the data seeded we can do it fairly easily for a demo. This is a short-term solution and I have trouble understanding why we need it. I looked at the logs of the bridges recently and they are impossible to read so potentially there is value in the logging update of this PR. I have seen how much of an issue Ultralight is being to our bridges, but I am not sure this is the solution for the problem.
1 million is just for a Devcon demo. A question I ask myself is are we building towards being able to sustain a network with latest state as fast as possible.
Currently we are running 3 state bridges which are gossiping redundant data. Maybe to speed up bridging we can implement the Horizontal Bridging architecture next. This would also be very high value as we need it for latest state bridging, but it would help us bridge 0-1 million blocks faster as well!
anyways the PR overall looks good. I brain dumped my view's a bit, sharing idea's or brainstorming is sometimes hard remote. If anybody has idea's on what should be the next step for the bridge feel free to contribute as well. Hopefully we can build the best bridge possible!
I think the current situation is a little unfortunate, but I definitely see it as an opportunity to build more robust software.
I think
- horizontal scaling bridge architecture
- smarter offer patterns
- resilient bridge which takes into account how other nodes perform on the network into consideration when gossiping.
- content size aware
- etc
Is definitely a direction we want to head in, I probably missed something.
Another thing is for the bridge. Instead of the bridge running a Trin executable, use Trin in the Bridge as a library. Then, we could potentially build the bridge as an extension of Trin or reuse Trin's routing table and remove the storage limit. By doing this we would benefit from
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wanted to add regarding client filtering.
I believe that this is good only as temporary solution. We should look into making census more robust, which we can do with somewhat easily because we keep tract of entire network anyway.
portal-bridge/src/cli.rs
Outdated
value_parser = client_parser, | ||
default_value = "", | ||
)] | ||
pub filter_clients: Arc<Vec<ClientType>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Clearly not related to this PR (as it is not the only place that does this) and you can ignore it, but why do we wrap config fields in Arc
?
Is it so that we can cheaply copy it? I don't think there is huge performance saving, as these arguments should be that big and they shouldn't be copied all the time (only at startup, during initialization). Or alternatively, we can wrap entire BridgeConfig
in Arc
.
Asking this because without Arc
, I believe we don't need custom parser logic (i.e. client_parser), and we could just use value_delimiter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I'm not sure exactly why the problem is caused, but it seems to come from how clap handles the custom parser method. The compiler doesn't complain, but when you run the program (without using Arc
), you get the following error.
Mismatch between definition and access of `filter_clients`. Could not downcast to portal_bridge::cli::ClientType, need to downcast to alloc::vec::Vec<portal_bridge::cli::ClientType>
However, in this case (ClientType
), since we really don't need any custom parsing logic (like subnetwork
) we can just use the value_delimiter
as you pointed out
Clearly, there is some discussion that needs to happen about "peer-scoring" (which we can have on Monday during sync). I agree that my approach of bluntly filtering out certain client implementations is 1. not nice 2. not smart 3. not sustainable. But, it is immediately useful (particularly in testing trin's concurrency via the bridge), so I'm gonna go ahead with the merge, with full understanding that this might be reversed post-conversation |
a24e90b
to
529c93e
Compare
What was wrong?
OfferReport
as a global offer report) were not so good... So I introduced a new, simpler type to manage "global" offer reportingHow was it fixed?
To-Do