-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Production go-ipfs going bananas #39
Comments
We're one version behind latest stable version of go-ipfs (we're on 0.4.19, latest is 0.4.20). Worth trying to upgrade to see if it helps as it's supposed to have performance improvements. |
According to @raulk from libp2p (discussion in Freenode #ipfs), seems related to the following issues: |
@vyzo if you need any information trying to track down the issue (pprof dump or whatever), let me know! |
Can you try running with libp2p/go-libp2p-connmgr#43 ? |
@vyzo I'll give that a try and report back. Thanks |
Log: Tried first to disable Circuit Relay-related experimental options:
Didn't do any change. Experimented a bit with the LowWater/HighWater values to see if it would ease-off a bit of load. Seems it's not aggressive enough and cannot reach the LowWater because of all the new incoming connections. We were one version behind go-ipfs (was on 0.4.19 while latest is 0.4.20), and the new version apparently should solve some performance issues. Deployed that version, no changes. Tried manually running As a last, counterintuitive effort to reduce CPU usage, I disabled the connection manager completely. Peer count jumped to around 90k (!!!) but it did indeed reduce the CPU usage. Memory usage is way higher now, but at least the node can respond to requests again. Now when things are at least working (although with reduced performance), I will try to patch in the PR linked above and deploy that version. Once confirmed working (or not), we can enable back the relay stuff. |
@victorb if your instance was acting as a relay node, it can take a while until the provider record is cleared from the network and peers stop attempting to connect to you for circuit relaying. As a way to short-circuit the process, could you try to change your public key to provoke a mismatch between the multiaddr and your actual identity, so that the connection is aborted? Alternatively you could change your port. I'd like to figure if the sudden inflow of connection is triggered by your node acting as a relay. |
Yeah, just did this, reset the PeerID to another one, as there were bunch of incoming connections I had no hope of stopping seemingly. Peer count and CPU load is now normal again. Seems to relate to relay, as those were the options I turned off before resetting the Peer ID as well. From https://dashboard.open-registry.dev/d/7SmAxSzZz/general?orgId=1&refresh=1m&from=now-30m&to=now |
Most certainly. But I'm still bit lost on why the sudden influx ~20 hours ago. I activated the Relay options about a week ago. |
Thanks to https://github.com/open-services/open-registry/issues/39<Paste> License: MIT Signed-off-by: Victor Bjelkholm <[email protected]>
Seems to start going into the same pattern again, currently connected to 20k peers and counting. Was working fine for a couple of hours, but then started getting more and more connections without the connection manager being able to keep up (seemingly). https://dashboard.open-registry.dev/dashboard/snapshot/S41NOTTKSbudcGLRDmJd5S2nl7PM20DM Swarm config looks like this (everything relay disabled):
Wasn't able to get a quick build of the PR linked above, less rushed since I thought I found another solution to the problem. But, seems not. |
On the upside, CPU is not nearly as badly affected as previously. Peer count seems to fluctuate less as well. But time will time what will happen until tomorrow, I need to rest |
I just deployed new version of go-ipfs built with libp2p/go-libp2p-connmgr#43 Let's see how it goes. @vyzo I see you just force-pushed the branch, should I rebuild with new the changes? |
I updated a small thing on the base branch (the pr is on top of another pr), that necessitated the rebase. It's a small thing, but it potentially saves allocations, so yeah, update. |
Alright. Around 22:45, the initial PR code was deployed. Now, around 00:07 the newly pushed changes were deployed. Will let it run over night and report back. |
@vyzo seems to be running better Peer count is kept within 1500<>2000, CPU usage is much lower, memory is stable and transferring data is now performing alright. |
Excellent! |
@vyzo things are much better now, but seems the connection manager still struggles to keep up sometimes. This happened about an hour ago:
ConnMgr values from config is currently: {
"GracePeriod": "20s",
"HighWater": 5000,
"LowWater": 2500,
"Type": "basic"
} Seems while the spike happened, memory was taken but not given back later. |
Update after 3 days (graph is last 7 days):
@vyzo @raulk seems that go-ipfs is still not really working properly... There is now 40k peers connected, using 0.2 of the CPU available and memory is growing past 10GB. Seems the connection manager still isn't disconnecting as many peers as it needs to, even after applying patch linked by @vyzo above. Edit: CPU usage seems much better than before applying patch above, but still go-ipfs is basically taking over the servers resources as the connection manager doesn't respect the values/cannot close enough peers |
this is weird. the only possible explanation is that the connection manager gets stuck, which is an issue @Stebalien has identified. |
We are working on a fix with libp2p/go-libp2p-circuit#76 |
see ipfs/kubo#6237 -- can you try building with go-ipfs master? It has the relevant patches applied. |
@vyzo Thanks a lot. Will do a deploy of go-ipfs master tomorrow morning and see if it improves the situation. |
Now |
Seemingly, in the last 24 hours, our deployed go-ipfs instance went from being connected to 1k peers, to hovering around 15k. This is putting a lot of load on the server and performance is being affected by this.
go-ipfs is currently using all of the 8 CPUs to the max, while only doing around 1MB/s receive/transmit, probably due to the amount of connections open.
ConnMgr
(The Connection Manager) is set to the following config, although it seems to not be working (as we're connected to 15k peers!)Dashboard Graph:
https://dashboard.open-registry.dev/d/7SmAxSzZz/general?orgId=1&from=now-24h&to=now
The text was updated successfully, but these errors were encountered: