Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

~80 peers, but config has been long set to 40 + 10 #2347

Closed
5 tasks
Tracked by #2617
gaia opened this issue Apr 1, 2023 · 15 comments
Closed
5 tasks
Tracked by #2617

~80 peers, but config has been long set to 40 + 10 #2347

gaia opened this issue Apr 1, 2023 · 15 comments
Assignees
Labels
scope: comet-bft type: bug Issues that need priority attention -- something isn't working

Comments

@gaia
Copy link

gaia commented Apr 1, 2023

Summary of Bug

More peers than allowed in config.

Version

9.0.1

Steps to Reproduce

How is this possible? Note config.toml last modification date, last time the service was restarted and how many peers it should have VS how many it has.

$ cat config.toml | grep _num_ && curl -s http://localhost:26657/net_info | jq -r .result.n_peers && /home/ubuntu/bin/gaiad version && systemctl status gaiad | grep 'Active:' &&
 ps aux | grep [g]aiad && ll config.toml
max_num_inbound_peers = 40
max_num_outbound_peers = 10
82
v9.0.1
     Active: active (running) since Sat 2023-04-01 19:39:22 UTC; 2h 58min ago
ubuntu       140  0.0 50.6 9698672 7915472 ?     Ssl  19:39 102:39 /home/ubuntu/bin/gaiad start
-rw-r--r-- 1 ubuntu ubuntu 19K Mar 20 23:38 config.toml

For Admin Use

  • Not duplicate issue
  • Appropriate labels applied
  • Appropriate contributors tagged
  • Contributor assigned/self-assigned
  • Is a spike necessary to map out how the issue should be approached?
@gaia gaia added type: bug Issues that need priority attention -- something isn't working status: waiting-triage This issue/PR has not yet been triaged by the team. labels Apr 1, 2023
@github-project-automation github-project-automation bot moved this to 🩹 Triage in Cosmos Hub Apr 1, 2023
@mpoke mpoke moved this from 🩹 Triage to 📥 Todo in Cosmos Hub Apr 2, 2023
@mpoke mpoke removed the status: waiting-triage This issue/PR has not yet been triaged by the team. label Apr 2, 2023
@faddat
Copy link
Contributor

faddat commented Apr 10, 2023

Do we think that this is a gaia specific bug, or do we think that this is an issue in tenderment or comet?

@gaia
Copy link
Author

gaia commented Apr 18, 2023

Noticed it on Juno v14.0.0 but not Juno v14.1.0. Maybe something got fixed upstream

@adizere
Copy link

adizere commented May 4, 2023

hi @gaia
how many nodes do you have configured in your unconditional_peer_ids ? See spec/p2p:

Unconditional Peers
These are IDs of the peers which are allowed to be connected by both inbound or outbound regardless of max_num_inbound_peers or max_num_outbound_peers of user's node reached or not.

@adizere
Copy link

adizere commented May 4, 2023

Do we think that this is a gaia specific bug, or do we think that this is an issue in tenderment or comet?

Likely Comet.

@gaia
Copy link
Author

gaia commented May 4, 2023

hi @gaia how many nodes do you have configured in your unconditional_peer_ids ? See spec/p2p:

Unconditional Peers
These are IDs of the peers which are allowed to be connected by both inbound or outbound regardless of max_num_inbound_peers or max_num_outbound_peers of user's node reached or not.

Good point and it'd be the obvious answer, but max 2 or 3 on any chain where I see the issue.

@adizere
Copy link

adizere commented May 5, 2023

It's possible there's a race condition between ensurePeers and acceptRoutine. We still need to double-check this, and currently it's very difficult because there's no specification, but maybe some of the unit tests could help.

Thanks @gaia for reporting this! Our team's triaging and debugging capacity is still ramping off, but we're looking into it.

@mmulji-ic shall I transfer this issue to comebft repo? Or would you like to handle that? I'm quite certain this is not Gaia specific.

@adizere
Copy link

adizere commented May 5, 2023

Can also tag it with the cometbft label.

@mmulji-ic
Copy link
Contributor

Hi @adizere , we still like to track this issue, could you open a new issue in the comet-bft repo and then link back to this issue. Added the comet-bft tag .

@cason
Copy link

cason commented May 23, 2023

The most likely reason is that described here: cometbft/cometbft#486

@cason
Copy link

cason commented May 23, 2023

In short, when a node is short of peer addresses it dials the configured seed nodes. When receiving addresses back from a seed, the node immediately starts dialing the provided addresses. This "fast dialing" execution flow disregards the maximum outbound peers configuration flag.

To confirm this hypothesis, are the inbound or outbound peers exceeding the maximum configured bounds?

@adizere
Copy link

adizere commented May 24, 2023

To confirm this hypothesis, are the inbound or outbound peers exceeding the maximum configured bounds?

@cason I'm not sure there is a way to distinguish from JSON/RPC /net_info calls between inbound or outbound peers. The n_peers result is an aggregate. Or is there another way?

@gaia Can you reproduce the issue and know how to distinguish between inbound or outbound peers?

@cason
Copy link

cason commented May 24, 2023

I am not sure, but the in the logs this information is printed every 30 seconds, INFO level, see here: https://github.com/cometbft/cometbft/blob/main/p2p/pex/pex_reactor.go#L457

@gaia
Copy link
Author

gaia commented May 24, 2023

I am not able to reproduce the issue currently. The max peers is being respected across several different clients, not just gaiad. Maybe it's because they've been running for a while and it's no longer dialing? I will check again later.

cat config.toml | egrep '_inbound|_outbound'
curl -s http://localhost:26657/net_info | jq .result.peers[].is_outbound | grep false | wc -l
curl -s http://localhost:26657/net_info | jq .result.peers[].is_outbound | grep true | wc -l

@cason
Copy link

cason commented May 25, 2023

Maybe it's because they've been running for a while and it's no longer dialing?

The situation I mentioned above only happens when the node dials a seed node. A node only dials a seed node when it is short of addresses, this can happens when the node is fresh and has no addresses at all on its address book and did not manage retrieve enough addresses from its initial peers (e.g. persistent peers).

@gaia
Copy link
Author

gaia commented Aug 29, 2023

happening now on Osmosis v18.0.0. Moving discussion to cometbft/cometbft#486

@gaia gaia closed this as completed Aug 29, 2023
@github-project-automation github-project-automation bot moved this from 🛑 Blocked to ✅ Done in Cosmos Hub Aug 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
scope: comet-bft type: bug Issues that need priority attention -- something isn't working
Projects
Status: ✅ Done
Development

No branches or pull requests

6 participants