Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: clean up p2p & implement missing peering functionality #2852

Open
wants to merge 64 commits into
base: master
Choose a base branch
from

Conversation

zivkovicmilos
Copy link
Member

@zivkovicmilos zivkovicmilos commented Sep 26, 2024

Description

Closes #2308

There is a lot happening in this PR, so don't be discouraged by the files changed.
I'll outline the way the PR should be reviewed, and give you pointers along the way.

What this PR set out to do

This PR initially set out to implement peer discovery in the TM2 p2p module -- that's it, basically.

The change was meant to be minimal, and not involve larger changes.
After spending more time than I'd like to admit in the p2p codebase, I've come to a couple of realizations:

  • the code is insanely complex for what it needs to be doing.
  • there are premature "optimizations" on every corner, with no real reason for them.
  • the unit tests in the p2p package are sketchy to say the least, and not convincing at all that we were covering actual functionality we needed to cover.
  • there are random disconnection issues with larger clusters.

All of these are temporarily fine, and not blocking us.

But the pattern was there -- to add peer discovery, it would require continuing the same pattern of code gymnastics present in the p2p module for the last 5+ years.

I took this opportunity, ahead of the mainnet launch, to fill out a few checkboxes:

  • simplify the code, trim everything that's excess. This is in line with our project PHILOSOPHY.md.
  • create a safety net in the form of integration tests, and unit tests, that run instantly, and give us the confidence stuff actually works.
  • make it easy peasy for us to debug future p2p problems, when they arise (we already keep seeing them on existing testnets, like test4 and test5.

I wanted to implement all of this, without breaking any existing TM2 functionality that relies on the p2p module, or introducing additional complexities.

What this PR actually accomplished

I'm proud to say that this PR brings more than a few bells and whistles to the table, in terms of TM2 improvements:

  • greatly simplified and faster Switch and Transport implementations. No more gazillion redundant checks, or expensive lookups, or convoluted APIs.
  • a unit testing suite we can be proud of, and have confidence in. I rewrote the entire testing suite for the module, because the old implementation had severe limitations on mock-ability.
  • peer discovery that works, and is not invasive. Goodbye random network pockets, and hanging nodes!
  • many bugs, and potential issues squashed and erased. Regressions added, and passing.

For the sake of not making groundbreaking changes ahead of mainnet, I didn't touch a few things, and this will be evident from the code:

  • I didn't touch the conn package, or how the multiplex connections are established and maintained (or the STS implementation) -- this would be too much, and require an exponential amount of time to get right.
  • the Peer abstraction is still the same, and TM2 modules interact with the peers in the same way as before (directly, as Reactors). I've outlined the issues with this in the README, so check it out.

In retrospect, I should've limited the scope of this PR by a lot. At the time I was at the mid-way point, I committed fully to leaving this module in a better state than I found it in, rather than leave additional tech debt for future cleanup.

The other primary goal of this PR is to scope out changes needed to upgrade the networking layer implementation into utilizing a stack like libp2p. I am happy to say that we have 0 limitations in terms of p2p functionality to make the switch. We're just bound by time. This upgrade is scheduled for after the mainnet MVP launch 🤞

How do I review this PR?

There is no point in looking at the older implementation, and trying to figure out the changes from there.

There are just too many, and it can get overwhelming quickly.

Instead, as the first step -- read the new p2p README. It outlines how the p2p module works on a core level, and highlights current challenges.

After the README, open the p2p package in tm2/pkg/p2p, and start looking at the implementation from there. Leave comments on things that are unclear, or can be improved. I'll try to answer and give as much context as I can.

What p2p config params are changed?

Here is a complete list of changed p2p configuration params, and the reasoning behind them:

  • UPNP - UPNP port forwarding, it was completely unused, removed.
  • PexReactor - peer exchange reactor enable-ment flag, renamed to PeerExchange, rename
  • SeedMode - enabled network crawls, was tied to PexReactor being true. Useless flag, since PeerExchange exists, removed
  • AllowDuplicateIP - useless flag to prevent same-IP dials, even outside previous dial "filters", removed
  • HandshakeTimeout - excessive config option for setting an STS timeout, sane default is 3s, removed
  • DialTimeout - excessive config option for finishing a peer dial, sane default is 3s, removed

The following config options were removed, as they related to testing, and were replaced by unit / integration tests.

  • TestDialFail
  • TestFuzz
  • TestFuzzConfig
Contributors' checklist...
  • Added new tests, or not needed, or not feasible
  • Provided an example (e.g. screenshot) to aid review or the PR is self-explanatory
  • Updated the official documentation or not needed
  • No breaking changes were made, or a BREAKING CHANGE: xxx message was included in the description
  • Added references to related issues and PRs
  • Provided any useful hints for running manual tests
  • Added new benchmarks to generated graphs, if any. More info here.

@zivkovicmilos zivkovicmilos added 🌱 feature New update to Gno 📦 🌐 tendermint v2 Issues or PRs tm2 related labels Sep 26, 2024
@zivkovicmilos zivkovicmilos self-assigned this Sep 26, 2024
@github-actions github-actions bot added the 📦 ⛰️ gno.land Issues or PRs gno.land package related label Sep 26, 2024
@zivkovicmilos zivkovicmilos changed the title feat: implement bootnode (seed peers) support feat: clean up p2p & implement bootnode (seed peers) support Oct 1, 2024
tm2/pkg/bft/blockchain/pool.go Outdated Show resolved Hide resolved
@@ -36,10 +26,10 @@ type P2PConfig struct {
UPNP bool `json:"upnp" toml:"upnp" comment:"UPNP port forwarding"`

// Maximum number of inbound peers
MaxNumInboundPeers int `json:"max_num_inbound_peers" toml:"max_num_inbound_peers" comment:"Maximum number of inbound peers"`
MaxNumInboundPeers uint64 `json:"max_num_inbound_peers" toml:"max_num_inbound_peers" comment:"Maximum number of inbound peers"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unsigned may be better than signed if we don't need to use patterns like -1. However, since we're constrained by the OS limits, a size of 64 may be excessive.

@Kouteki Kouteki added the in focus Core team is prioritizing this work label Oct 18, 2024
@zivkovicmilos
Copy link
Member Author

There is a leftover issue with this PR that I'm currently investigating, which is a hanging test somewhere in TM2.

I will resolve this before opening the PR for reviews 🙏

@Kouteki Kouteki mentioned this pull request Nov 19, 2024
@zivkovicmilos zivkovicmilos marked this pull request as ready for review November 19, 2024 16:58
@zivkovicmilos zivkovicmilos requested a review from moul November 19, 2024 16:58
@zivkovicmilos
Copy link
Member Author

Opening this up for an initial review 🙏

I have no idea why the gno.land testing suite hangs on the PR -- it runs fine locally 🤷‍♂️

@zivkovicmilos
Copy link
Member Author

@sw360cab @albttx @r3v4s

Ready for you to take this for a spin 🙏

Copy link
Member

@albttx albttx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you check somewhere to not have a number of peers greater than max_num_outboud_peers?
I see some calls to NumOutbound(), but i don't think i saw this limitation.

I've made a test with a local node, and i'm failing to peer with test5.

A devnet cluster with this PR would be amazing to make some tests

// peers. If another node asks it for addresses, it responds and disconnects.
//
// Does not work if the peer-exchange reactor is disabled.
SeedMode bool `json:"seed_mode" toml:"seed_mode" comment:"Seed mode, in which node constantly crawls the network and looks for\n peers. If another node asks it for addresses, it responds and disconnects.\n\n Does not work if the peer-exchange reactor is disabled."`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So there is no seed_mode anymore right ? So all pex node are seeding and constently crowling the network ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seed mode and Pex are essentially tied together in Comet -- you cannot have seed mode if pex is false.

I've dropped these concepts, and adopted a much more simpler PeerExchange that does network crawling

Copy link
Contributor

@ajnavarro ajnavarro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to review this PR as thoroughly as I could, but it is too much. I vote to merge as it is and change/fix things as needed.

// the consensus machine
SwitchToConsensus(sm.State, int)
}
type SwitchToConsensusFn func(sm.State, int)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add some comments here to understand why this is needed?

@@ -1777,7 +1777,7 @@ func TestStateOutputVoteStats(t *testing.T) {

cs, vss := randConsensusState(2)
// create dummy peer
peer := p2pmock.NewPeer(nil)
peer := &p2pmock.Peer{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why one is a pointer and the other is not?

@@ -162,7 +163,7 @@ func (memR *Reactor) Receive(chID byte, src p2p.Peer, msgBytes []byte) {

switch msg := msg.(type) {
case *TxMessage:
peerID := memR.ids.GetForPeer(src)
peerID := memR.ids.GetForPeer(src.ID())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit strange IMO. src is already a Peer with an ID, why do we have to call ids.GetForPeer to get the peerID?

Comment on lines +438 to +448
reactors := []nodeReactor{
{
"MEMPOOL", mempoolReactor,
},
{
"BLOCKCHAIN", bcReactor,
},
{
"CONSENSUS", consensusReactor,
},
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we use constants for these strings?

if config.P2P.PeerExchange {
discoveryReactor = discovery.NewReactor()

discoveryReactor.SetLogger(logger.With("module", "discovery"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we use constant for these strings?

}

// PeerSet has a (immutable) subset of the methods of PeerSet.
type PeerSet interface {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No error handling, so really difficult to implement that interface using other persistence than memory...


// Grab a random peer, and engage
// them for peer discovery
peers := r.Switch.Peers().List()
Copy link
Contributor

@ajnavarro ajnavarro Nov 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How many elements peers are we expecting to have here? are they always on memory? I think that this interface should return an iterator instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code of discovery.go and Switch is a bit flaky in general, there is almost no error handling. For something like network connections, it is super important to have a robust system.

Comment on lines +38 to +39
Send(byte, []byte) bool
TrySend(byte, []byte) bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No error handling, so the amount of info you can send to the callee is just yes/no. Also, what is the difference between Send and TrySend?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @ajnavarro that it is hard to understand the difference between the two methods.
I guess bool is an alternative of the classic err pattern?

Accept(context.Context, PeerBehavior) (Peer, error)

// Dial dials a peer, and returns it
Dial(context.Context, types.NetAddress, PeerBehavior) (Peer, error)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why Dial is returning a Peer instead of something like a PeerConnection?

@Gno2D2
Copy link
Collaborator

Gno2D2 commented Nov 29, 2024

🛠 PR Checks Summary

All Automated Checks passed. ✅

Manual Checks (for Reviewers):
  • IGNORE the bot requirements for this PR (force green CI check)
Read More

🤖 This bot helps streamline PR reviews by verifying automated checks and providing guidance for contributors and reviewers.

✅ Automated Checks (for Contributors):

No automated checks match this pull request.

☑️ Contributor Actions:
  1. Fix any issues flagged by automated checks.
  2. Follow the Contributor Checklist to ensure your PR is ready for review.
    • Add new tests, or document why they are unnecessary.
    • Provide clear examples/screenshots, if necessary.
    • Update documentation, if required.
    • Ensure no breaking changes, or include BREAKING CHANGE notes.
    • Link related issues/PRs, where applicable.
☑️ Reviewer Actions:
  1. Complete manual checks for the PR, including the guidelines and additional checks if applicable.
📚 Resources:
Debug
Manual Checks
**IGNORE** the bot requirements for this PR (force green CI check)

If

🟢 Condition met
└── 🟢 On every pull request

Can be checked by

  • Any user with comment edit permission

Copy link
Contributor

@sw360cab sw360cab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not easy to review the whole PR since it involves a lot of changes. I generally understood that it aims to make all the p2p networking longer smooth, and the PR description and the README (tm2/pkg/p2p/README.md) explains the main goals perfectly.
I tend to agree with @ajnavarro that we should somehow merge&see, since it is not easy to imagine any possible corner case or error handling required.
Finally I think that this is a great improvement and apart from maybe more testing and tighter error handling, it should be considered as the cornerstone for any other future improvement related to the p2p networking.

sw.peerBehavior.reactorsByCh[chID] = reactor
}

sw.reactors[name] = reactor
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The MultiplexSwitch is holding the map of reactors and each reactor is setting the switch in turn.
Is the reactor existence tied to the application running, or they their removal is handled somewhere?

This process repeats at specific intervals. It is worth nothing that if the limit of outbound peers is reached, the peer
dials have no effect.

#### Bootnodes (Seeds)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this differentiation present somehow in the code?

Because currently it is not easy to understand the purpose or the difference between a persistent peer and a boot/seed node (as long as boot and seed do mean the same thing here)

There is a final service that runs alongside the previously-mentioned `Switch` services — peer discovery.

Every blockchain node needs an adequate amount of peers to communicate with, in order to ensure smooth functioning. For
validator nodes, they need to be *loosely connected* to at least 2/3+ of the validator set in order to participate and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the node have any awareness of that?

I mean since the network is loosely connected it is not straightforward for the node to know that is current topology network is connected enough to satisfy the 2/3.

Comment on lines +38 to +39
Send(byte, []byte) bool
TrySend(byte, []byte) bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @ajnavarro that it is hard to understand the difference between the two methods.
I guess bool is an alternative of the classic err pattern?

@@ -154,10 +153,14 @@ import (
// }
//
// ```
func NetInfo(ctx *rpctypes.Context) (*ctypes.ResultNetInfo, error) {
out, in, _ := p2pPeers.NumPeers()
func NetInfo(_ *rpctypes.Context) (*ctypes.ResultNetInfo, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why exactly this method, which is supposed to give information about the network, is actually updating the network information too.
Does it mean the network status is only updated upon an explicit request?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose of this methods is confusing. It is repeatedly reported testing purpose, so why they are not in an explicit _test file or package?
It seems that it is only used within test files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
in focus Core team is prioritizing this work 📦 🌐 tendermint v2 Issues or PRs tm2 related 📦 ⛰️ gno.land Issues or PRs gno.land package related 🌱 feature New update to Gno
Projects
Status: In Progress
Status: In Review
7 participants