Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix broken bridge stats reporting & add filter-clients flag #1493

Merged
merged 3 commits into from
Sep 27, 2024

Conversation

njgheorghita
Copy link
Collaborator

@njgheorghita njgheorghita commented Sep 26, 2024

What was wrong?

  • The changes made in fix: update slow state bridge defaults #1490 (specifically using OfferReport as a global offer report) were not so good... So I introduced a new, simpler type to manage "global" offer reporting
  • When running the bridge locally, it seems to have quite a bit of trouble communicating with ultralight nodes. I added a flag to filter out specific client types. Not necessarily meant to be used on all of our bridges, but maybe on a few to speed up the process

How was it fixed?

To-Do

@njgheorghita njgheorghita changed the title Bridge debug Fix broken bridge stats reporting & add filter-clients flag Sep 26, 2024
@njgheorghita njgheorghita force-pushed the bridge-debug branch 2 times, most recently from 0a09161 to 3a306ed Compare September 26, 2024 19:15
@njgheorghita njgheorghita marked this pull request as ready for review September 26, 2024 19:39
@njgheorghita njgheorghita requested review from morph-dev, KolbyML and carver and removed request for morph-dev and KolbyML September 26, 2024 19:39
portal-bridge/src/bridge/state.rs Outdated Show resolved Hide resolved
portal-bridge/src/bridge/state.rs Show resolved Hide resolved

impl From<&Enr> for ClientType {
fn from(enr: &Enr) -> Self {
if let Some(client_id) = UtpEnr(enr.clone()).client() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this doesn't have anything to do with uTP, this usage stands out as strange. Should it be renamed? Should we extract enr parsing logic into a new more-generic struct?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enr's shouldn't hold client_id in a month or two, so whatever is chosen it shouldn't be the worst

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now we're really going on a tangent, but since we started: I can see some downsides to this direction of pinging for metadata, like we would have a new funny class of peer that is connected, but we don't have enough information to interact with them yet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but we don't have enough information to interact with them yet.

I am not sure what this means. We shouldn't need to know which implementation we are talking to interop with other clients.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't need to know which implementation we are talking to interop with other clients.

We shouldn't, but clearly it is the most expedient option at times, like in this PR. I'm all for changing to eventually doing peer scoring and all that. It will just be more painful to use this PRs approach until then (and maybe there are other similar scenarios yet to be discovered).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed using UtpEnr here doesn't track that well. Enr is a "foreign" type, so it's not so simple to implement client() directly on it. I just copied the logic over since this might be changed soon, but regardless of where we go with ping's in the future, we'll be able to recognize that this logic also needs updating if we remove ENR_PORTAL_CLIENT_KEY

portal-bridge/src/census.rs Outdated Show resolved Hide resolved
portal-bridge/src/census.rs Outdated Show resolved Hide resolved
@@ -256,7 +278,7 @@ impl Network {
None
}
})
.take(ENRS_RESPONSE_LIMIT)
.take(self.enr_offer_limit)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I guess whatever order peers is in is the order that we tend to prefer offering content to. Maybe we should shuffle (or partial_shuffle) before taking here, in order to push to a broader mix of peers

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the order in this case should always be the same. Sorted by closest to content id. Each census has the same view of the network (eg 100% of peers), so each census will be gossiping the same content to the same peers. Which is not great. But. Eventually, the state bridges (as we "scale" horizontally) will only gossip content that is close their own node id. In which case, we lose this redundancy but maintain the feature (which is desired) of gossiping content to their closest nodes on the network (which would be lost if we shuffle). I guess, partial_shuffle is an option... and would mitigate the grey area that we're currently in (aka until we're horizontally scaled). But I'm not sure it's that important overall... though I could be convinced otherwise

@KolbyML
Copy link
Member

KolbyML commented Sep 26, 2024

If ultralight can't perform maybe they are not ready for mainnet.

When running the bridge locally, it seems to have quite a bit of trouble communicating with ultralight nodes. I added a flag to filter out specific client types. Not necessarily meant to be used on all of our bridges, but maybe on a few to speed up the process

I think a long-term solution to this problem will be a reputation system in which, instead of banning an implementation if a client sends us errors, we put them on a block list for X period of time.

In any case, it is fairly trivial to write a script to populate all of our State node databases with the first 1 million blocks locally. So if bridging doesn't work out before Devcon, we will still be safe.

Copy link
Member

@KolbyML KolbyML left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't like how we need a flag to block connections with a certain implementation.

If that is the case I think

  • they shouldn't run their clients on mainnet, because they aren't ready and it is in bad faith
  • or we build our client to be resilient to bad actors, which I don't think this PR does.

I think it would be easier to build a client resistant to bad actors than a short-term solution like this especially as we aren't in a rush to meet deadlines if we need the data seeded we can do it fairly easily for a demo. This is a short-term solution and I have trouble understanding why we need it. I looked at the logs of the bridges recently and they are impossible to read so potentially there is value in the logging update of this PR. I have seen how much of an issue Ultralight is being to our bridges, but I am not sure this is the solution for the problem.

1 million is just for a Devcon demo. A question I ask myself is are we building towards being able to sustain a network with latest state as fast as possible.

Currently we are running 3 state bridges which are gossiping redundant data. Maybe to speed up bridging we can implement the Horizontal Bridging architecture next. This would also be very high value as we need it for latest state bridging, but it would help us bridge 0-1 million blocks faster as well!

:shipit: anyways the PR overall looks good. I brain dumped my view's a bit, sharing idea's or brainstorming is sometimes hard remote. If anybody has idea's on what should be the next step for the bridge feel free to contribute as well. Hopefully we can build the best bridge possible!

I think the current situation is a little unfortunate, but I definitely see it as an opportunity to build more robust software.

I think

  • horizontal scaling bridge architecture
  • smarter offer patterns
    • resilient bridge which takes into account how other nodes perform on the network into consideration when gossiping.
    • content size aware
    • etc

Is definitely a direction we want to head in, I probably missed something.

@KolbyML
Copy link
Member

KolbyML commented Sep 26, 2024

Another thing is for the bridge.

Instead of the bridge running a Trin executable, use Trin in the Bridge as a library. Then, we could potentially build the bridge as an extension of Trin or reuse Trin's routing table and remove the storage limit.

By doing this we would benefit from

  • reduce JSON-RPC throughput overhead. Encoding big hex strings is very slow so being able to directly do calls would be more efficient (I believe during our EF monthly team showoff day, there was a request for something like this and it would be useful for the bridge)
  • reduce the overhead of maintaining 2 different routing tables
  • I would no longer need a Trin executable in my target folder when running the bridge, which can sometimes be annoying. (This is more of a personal one xD)

Copy link
Collaborator

@morph-dev morph-dev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wanted to add regarding client filtering.

I believe that this is good only as temporary solution. We should look into making census more robust, which we can do with somewhat easily because we keep tract of entire network anyway.

portal-bridge/src/bridge/state.rs Outdated Show resolved Hide resolved
portal-bridge/src/cli.rs Outdated Show resolved Hide resolved
value_parser = client_parser,
default_value = "",
)]
pub filter_clients: Arc<Vec<ClientType>>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Clearly not related to this PR (as it is not the only place that does this) and you can ignore it, but why do we wrap config fields in Arc?

Is it so that we can cheaply copy it? I don't think there is huge performance saving, as these arguments should be that big and they shouldn't be copied all the time (only at startup, during initialization). Or alternatively, we can wrap entire BridgeConfig in Arc.

Asking this because without Arc, I believe we don't need custom parser logic (i.e. client_parser), and we could just use value_delimiter

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I'm not sure exactly why the problem is caused, but it seems to come from how clap handles the custom parser method. The compiler doesn't complain, but when you run the program (without using Arc), you get the following error.

Mismatch between definition and access of `filter_clients`. Could not downcast to portal_bridge::cli::ClientType, need to downcast to alloc::vec::Vec<portal_bridge::cli::ClientType>

However, in this case (ClientType), since we really don't need any custom parsing logic (like subnetwork) we can just use the value_delimiter as you pointed out

@njgheorghita
Copy link
Collaborator Author

Clearly, there is some discussion that needs to happen about "peer-scoring" (which we can have on Monday during sync). I agree that my approach of bluntly filtering out certain client implementations is 1. not nice 2. not smart 3. not sustainable. But, it is immediately useful (particularly in testing trin's concurrency via the bridge), so I'm gonna go ahead with the merge, with full understanding that this might be reversed post-conversation

@njgheorghita njgheorghita merged commit 3142c5c into ethereum:master Sep 27, 2024
9 checks passed
@njgheorghita njgheorghita deleted the bridge-debug branch September 27, 2024 18:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants