Fix broken bridge stats reporting & add filter-clients flag #1493

njgheorghita · 2024-09-26T19:08:11Z

What was wrong?

The changes made in fix: update slow state bridge defaults #1490 (specifically using OfferReport as a global offer report) were not so good... So I introduced a new, simpler type to manage "global" offer reporting
When running the bridge locally, it seems to have quite a bit of trouble communicating with ultralight nodes. I added a flag to filter out specific client types. Not necessarily meant to be used on all of our bridges, but maybe on a few to speed up the process

How was it fixed?

To-Do

Clean up commit history and use conventional commits.

portal-bridge/src/bridge/state.rs

carver · 2024-09-26T21:38:01Z

portal-bridge/src/cli.rs

+
+impl From<&Enr> for ClientType {
+    fn from(enr: &Enr) -> Self {
+        if let Some(client_id) = UtpEnr(enr.clone()).client() {


Since this doesn't have anything to do with uTP, this usage stands out as strange. Should it be renamed? Should we extract enr parsing logic into a new more-generic struct?

Enr's shouldn't hold client_id in a month or two, so whatever is chosen it shouldn't be the worst

Now we're really going on a tangent, but since we started: I can see some downsides to this direction of pinging for metadata, like we would have a new funny class of peer that is connected, but we don't have enough information to interact with them yet.

but we don't have enough information to interact with them yet.

I am not sure what this means. We shouldn't need to know which implementation we are talking to interop with other clients.

We shouldn't need to know which implementation we are talking to interop with other clients.

We shouldn't, but clearly it is the most expedient option at times, like in this PR. I'm all for changing to eventually doing peer scoring and all that. It will just be more painful to use this PRs approach until then (and maybe there are other similar scenarios yet to be discovered).

Agreed using UtpEnr here doesn't track that well. Enr is a "foreign" type, so it's not so simple to implement client() directly on it. I just copied the logic over since this might be changed soon, but regardless of where we go with ping's in the future, we'll be able to recognize that this logic also needs updating if we remove ENR_PORTAL_CLIENT_KEY

portal-bridge/src/census.rs

carver · 2024-09-26T21:46:39Z

portal-bridge/src/census.rs

@@ -256,7 +278,7 @@ impl Network {
                    None
                }
            })
-            .take(ENRS_RESPONSE_LIMIT)
+            .take(self.enr_offer_limit)


So I guess whatever order peers is in is the order that we tend to prefer offering content to. Maybe we should shuffle (or partial_shuffle) before taking here, in order to push to a broader mix of peers

Well, the order in this case should always be the same. Sorted by closest to content id. Each census has the same view of the network (eg 100% of peers), so each census will be gossiping the same content to the same peers. Which is not great. But. Eventually, the state bridges (as we "scale" horizontally) will only gossip content that is close their own node id. In which case, we lose this redundancy but maintain the feature (which is desired) of gossiping content to their closest nodes on the network (which would be lost if we shuffle). I guess, partial_shuffle is an option... and would mitigate the grey area that we're currently in (aka until we're horizontally scaled). But I'm not sure it's that important overall... though I could be convinced otherwise

KolbyML · 2024-09-26T22:15:29Z

If ultralight can't perform maybe they are not ready for mainnet.

When running the bridge locally, it seems to have quite a bit of trouble communicating with ultralight nodes. I added a flag to filter out specific client types. Not necessarily meant to be used on all of our bridges, but maybe on a few to speed up the process

I think a long-term solution to this problem will be a reputation system in which, instead of banning an implementation if a client sends us errors, we put them on a block list for X period of time.

In any case, it is fairly trivial to write a script to populate all of our State node databases with the first 1 million blocks locally. So if bridging doesn't work out before Devcon, we will still be safe.

KolbyML

I really don't like how we need a flag to block connections with a certain implementation.

If that is the case I think

they shouldn't run their clients on mainnet, because they aren't ready and it is in bad faith
or we build our client to be resilient to bad actors, which I don't think this PR does.

I think it would be easier to build a client resistant to bad actors than a short-term solution like this especially as we aren't in a rush to meet deadlines if we need the data seeded we can do it fairly easily for a demo. This is a short-term solution and I have trouble understanding why we need it. I looked at the logs of the bridges recently and they are impossible to read so potentially there is value in the logging update of this PR. I have seen how much of an issue Ultralight is being to our bridges, but I am not sure this is the solution for the problem.

1 million is just for a Devcon demo. A question I ask myself is are we building towards being able to sustain a network with latest state as fast as possible.

Currently we are running 3 state bridges which are gossiping redundant data. Maybe to speed up bridging we can implement the Horizontal Bridging architecture next. This would also be very high value as we need it for latest state bridging, but it would help us bridge 0-1 million blocks faster as well!

anyways the PR overall looks good. I brain dumped my view's a bit, sharing idea's or brainstorming is sometimes hard remote. If anybody has idea's on what should be the next step for the bridge feel free to contribute as well. Hopefully we can build the best bridge possible!

I think the current situation is a little unfortunate, but I definitely see it as an opportunity to build more robust software.

I think

horizontal scaling bridge architecture
smarter offer patterns
- resilient bridge which takes into account how other nodes perform on the network into consideration when gossiping.
- content size aware
- etc

Is definitely a direction we want to head in, I probably missed something.

KolbyML · 2024-09-26T23:44:33Z

Another thing is for the bridge.

Instead of the bridge running a Trin executable, use Trin in the Bridge as a library. Then, we could potentially build the bridge as an extension of Trin or reuse Trin's routing table and remove the storage limit.

By doing this we would benefit from

reduce JSON-RPC throughput overhead. Encoding big hex strings is very slow so being able to directly do calls would be more efficient (I believe during our EF monthly team showoff day, there was a request for something like this and it would be useful for the bridge)
reduce the overhead of maintaining 2 different routing tables
I would no longer need a Trin executable in my target folder when running the bridge, which can sometimes be annoying. (This is more of a personal one xD)

morph-dev

Just wanted to add regarding client filtering.

I believe that this is good only as temporary solution. We should look into making census more robust, which we can do with somewhat easily because we keep tract of entire network anyway.

portal-bridge/src/bridge/state.rs

portal-bridge/src/cli.rs

morph-dev · 2024-09-27T08:29:46Z

portal-bridge/src/cli.rs

+        value_parser = client_parser,
+        default_value = "",
+    )]
+    pub filter_clients: Arc<Vec<ClientType>>,


nit: Clearly not related to this PR (as it is not the only place that does this) and you can ignore it, but why do we wrap config fields in Arc?

Is it so that we can cheaply copy it? I don't think there is huge performance saving, as these arguments should be that big and they shouldn't be copied all the time (only at startup, during initialization). Or alternatively, we can wrap entire BridgeConfig in Arc.

Asking this because without Arc, I believe we don't need custom parser logic (i.e. client_parser), and we could just use value_delimiter

Well, I'm not sure exactly why the problem is caused, but it seems to come from how clap handles the custom parser method. The compiler doesn't complain, but when you run the program (without using Arc), you get the following error.

Mismatch between definition and access of `filter_clients`. Could not downcast to portal_bridge::cli::ClientType, need to downcast to alloc::vec::Vec<portal_bridge::cli::ClientType>

However, in this case (ClientType), since we really don't need any custom parsing logic (like subnetwork) we can just use the value_delimiter as you pointed out

njgheorghita · 2024-09-27T18:00:50Z

Clearly, there is some discussion that needs to happen about "peer-scoring" (which we can have on Monday during sync). I agree that my approach of bluntly filtering out certain client implementations is 1. not nice 2. not smart 3. not sustainable. But, it is immediately useful (particularly in testing trin's concurrency via the bridge), so I'm gonna go ahead with the merge, with full understanding that this might be reversed post-conversation

fix: buggy state bridge global offer report

fcfcbea

njgheorghita changed the title ~~Bridge debug~~ Fix broken bridge stats reporting & add filter-clients flag Sep 26, 2024

njgheorghita force-pushed the bridge-debug branch 2 times, most recently from 0a09161 to 3a306ed Compare September 26, 2024 19:15

feat: filter clients from state bridge offer process

5af02b1

njgheorghita force-pushed the bridge-debug branch from 3a306ed to 5af02b1 Compare September 26, 2024 19:29

njgheorghita marked this pull request as ready for review September 26, 2024 19:39

njgheorghita requested review from morph-dev, KolbyML and carver and removed request for morph-dev and KolbyML September 26, 2024 19:39

carver approved these changes Sep 26, 2024

View reviewed changes

KolbyML approved these changes Sep 26, 2024

View reviewed changes

morph-dev approved these changes Sep 27, 2024

View reviewed changes

fix: pr feedback

529c93e

njgheorghita force-pushed the bridge-debug branch from a24e90b to 529c93e Compare September 27, 2024 18:11

njgheorghita merged commit 3142c5c into ethereum:master Sep 27, 2024
9 checks passed

njgheorghita deleted the bridge-debug branch September 27, 2024 18:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix broken bridge stats reporting & add filter-clients flag #1493

Fix broken bridge stats reporting & add filter-clients flag #1493

njgheorghita commented Sep 26, 2024 •

edited

Loading

carver Sep 26, 2024

KolbyML Sep 26, 2024

carver Sep 26, 2024

KolbyML Sep 26, 2024

carver Sep 27, 2024

njgheorghita Sep 27, 2024

carver Sep 26, 2024

njgheorghita Sep 27, 2024

KolbyML commented Sep 26, 2024

KolbyML left a comment

KolbyML commented Sep 26, 2024 •

edited

Loading

morph-dev left a comment

morph-dev Sep 27, 2024

njgheorghita Sep 27, 2024

njgheorghita commented Sep 27, 2024

Fix broken bridge stats reporting & add filter-clients flag #1493

Fix broken bridge stats reporting & add filter-clients flag #1493

Conversation

njgheorghita commented Sep 26, 2024 • edited Loading

What was wrong?

How was it fixed?

To-Do

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KolbyML commented Sep 26, 2024

KolbyML left a comment

Choose a reason for hiding this comment

KolbyML commented Sep 26, 2024 • edited Loading

morph-dev left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

njgheorghita commented Sep 27, 2024

njgheorghita commented Sep 26, 2024 •

edited

Loading

KolbyML commented Sep 26, 2024 •

edited

Loading