Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Production go-ipfs going bananas #39

Open
victorb opened this issue May 16, 2019 · 23 comments
Open

Production go-ipfs going bananas #39

victorb opened this issue May 16, 2019 · 23 comments
Labels
breaking-production Issues that needs to be fixed ASAP as they affect production negatively bug Something isn't working

Comments

@victorb
Copy link
Member

victorb commented May 16, 2019

Seemingly, in the last 24 hours, our deployed go-ipfs instance went from being connected to 1k peers, to hovering around 15k. This is putting a lot of load on the server and performance is being affected by this.

go-ipfs is currently using all of the 8 CPUs to the max, while only doing around 1MB/s receive/transmit, probably due to the amount of connections open.

ConnMgr (The Connection Manager) is set to the following config, although it seems to not be working (as we're connected to 15k peers!)

"ConnMgr": {
      "GracePeriod": "20s",
      "HighWater": 2000,
      "LowWater": 1500,
      "Type": "basic"
    },

Dashboard Graph:
image
https://dashboard.open-registry.dev/d/7SmAxSzZz/general?orgId=1&from=now-24h&to=now

@victorb
Copy link
Member Author

victorb commented May 16, 2019

We're one version behind latest stable version of go-ipfs (we're on 0.4.19, latest is 0.4.20). Worth trying to upgrade to see if it helps as it's supposed to have performance improvements.

@victorb
Copy link
Member Author

victorb commented May 16, 2019

@victorb
Copy link
Member Author

victorb commented May 16, 2019

@vyzo if you need any information trying to track down the issue (pprof dump or whatever), let me know!

@vyzo
Copy link

vyzo commented May 16, 2019

Can you try running with libp2p/go-libp2p-connmgr#43 ?
It has fixed our connection manager woes in the test relay, except the duplicate connections issue which is being worked on.

@victorb
Copy link
Member Author

victorb commented May 16, 2019

@vyzo I'll give that a try and report back. Thanks

@victorb victorb added breaking-production Issues that needs to be fixed ASAP as they affect production negatively bug Something isn't working labels May 16, 2019
@victorb
Copy link
Member Author

victorb commented May 16, 2019

Log:

Tried first to disable Circuit Relay-related experimental options:

ipfs config --json Swarm.EnableAutoNATService false
ipfs config --json Swarm.EnableAutoRelay false
ipfs config --json Swarm.EnableRelayHop false

Didn't do any change.

Experimented a bit with the LowWater/HighWater values to see if it would ease-off a bit of load. Seems it's not aggressive enough and cannot reach the LowWater because of all the new incoming connections.

We were one version behind go-ipfs (was on 0.4.19 while latest is 0.4.20), and the new version apparently should solve some performance issues. Deployed that version, no changes.

Tried manually running ipfs swarm peers | xargs ipfs swarm disconnect to disconnect all peers, went down to 4k peers (but seems it couldn't disconnect all of them, although no errors from ipfs swarm disconnect) but after 1-2 minutes jumped up to 15k again.

As a last, counterintuitive effort to reduce CPU usage, I disabled the connection manager completely. Peer count jumped to around 90k (!!!) but it did indeed reduce the CPU usage. Memory usage is way higher now, but at least the node can respond to requests again.

Now when things are at least working (although with reduced performance), I will try to patch in the PR linked above and deploy that version.

Once confirmed working (or not), we can enable back the relay stuff.

@raulk
Copy link

raulk commented May 16, 2019

@victorb if your instance was acting as a relay node, it can take a while until the provider record is cleared from the network and peers stop attempting to connect to you for circuit relaying.

As a way to short-circuit the process, could you try to change your public key to provoke a mismatch between the multiaddr and your actual identity, so that the connection is aborted? Alternatively you could change your port.

I'd like to figure if the sudden inflow of connection is triggered by your node acting as a relay.

@victorb
Copy link
Member Author

victorb commented May 16, 2019

As a way to short-circuit the process, could you try to change your public key to provoke a mismatch between the multiaddr and your actual identity, so that the connection is aborted? Alternatively you could change your port.

Yeah, just did this, reset the PeerID to another one, as there were bunch of incoming connections I had no hope of stopping seemingly.

Peer count and CPU load is now normal again. Seems to relate to relay, as those were the options I turned off before resetting the Peer ID as well.

image

From https://dashboard.open-registry.dev/d/7SmAxSzZz/general?orgId=1&refresh=1m&from=now-30m&to=now

@victorb
Copy link
Member Author

victorb commented May 16, 2019

@raulk

I'd like to figure if the sudden inflow of connection is triggered by your node acting as a relay.

Most certainly. But I'm still bit lost on why the sudden influx ~20 hours ago. I activated the Relay options about a week ago.

victorb referenced this issue in open-services/bolivar May 16, 2019
@victorb victorb removed the breaking-production Issues that needs to be fixed ASAP as they affect production negatively label May 16, 2019
@victorb
Copy link
Member Author

victorb commented May 16, 2019

Seems to start going into the same pattern again, currently connected to 20k peers and counting. Was working fine for a couple of hours, but then started getting more and more connections without the connection manager being able to keep up (seemingly).

image

https://dashboard.open-registry.dev/dashboard/snapshot/S41NOTTKSbudcGLRDmJd5S2nl7PM20DM

Swarm config looks like this (everything relay disabled):

"Swarm": {
    "AddrFilters": [
      "/ip4/10.0.0.0/ipcidr/8",
      "/ip4/100.64.0.0/ipcidr/10",
      "/ip4/169.254.0.0/ipcidr/16",
      "/ip4/172.16.0.0/ipcidr/12",
      "/ip4/192.0.0.0/ipcidr/24",
      "/ip4/192.0.0.0/ipcidr/29",
      "/ip4/192.0.0.8/ipcidr/32",
      "/ip4/192.0.0.170/ipcidr/32",
      "/ip4/192.0.0.171/ipcidr/32",
      "/ip4/192.0.2.0/ipcidr/24",
      "/ip4/192.168.0.0/ipcidr/16",
      "/ip4/198.18.0.0/ipcidr/15",
      "/ip4/198.51.100.0/ipcidr/24",
      "/ip4/203.0.113.0/ipcidr/24",
      "/ip4/240.0.0.0/ipcidr/4"
    ],
    "ConnMgr": {
      "GracePeriod": "20s",
      "HighWater": 2000,
      "LowWater": 1500,
      "Type": "basic"
    },
    "DisableBandwidthMetrics": false,
    "DisableNatPortMap": true,
    "DisableRelay": false,
    "EnableAutoNATService": false,
    "EnableAutoRelay": false,
    "EnableRelayHop": false
  }

Wasn't able to get a quick build of the PR linked above, less rushed since I thought I found another solution to the problem. But, seems not.

@victorb victorb added the breaking-production Issues that needs to be fixed ASAP as they affect production negatively label May 16, 2019
@victorb
Copy link
Member Author

victorb commented May 16, 2019

On the upside, CPU is not nearly as badly affected as previously. Peer count seems to fluctuate less as well. But time will time what will happen until tomorrow, I need to rest

@victorb
Copy link
Member Author

victorb commented May 17, 2019

I just deployed new version of go-ipfs built with libp2p/go-libp2p-connmgr#43 Let's see how it goes.

@vyzo I see you just force-pushed the branch, should I rebuild with new the changes?

@vyzo
Copy link

vyzo commented May 17, 2019

I updated a small thing on the base branch (the pr is on top of another pr), that necessitated the rebase. It's a small thing, but it potentially saves allocations, so yeah, update.

@victorb
Copy link
Member Author

victorb commented May 17, 2019

Alright. Around 22:45, the initial PR code was deployed. Now, around 00:07 the newly pushed changes were deployed.

Will let it run over night and report back.

@victorb
Copy link
Member Author

victorb commented May 18, 2019

@vyzo seems to be running better

image

Peer count is kept within 1500<>2000, CPU usage is much lower, memory is stable and transferring data is now performing alright.

@vyzo
Copy link

vyzo commented May 18, 2019

Excellent!

@victorb victorb removed the breaking-production Issues that needs to be fixed ASAP as they affect production negatively label May 19, 2019
@victorb
Copy link
Member Author

victorb commented May 19, 2019

@vyzo things are much better now, but seems the connection manager still struggles to keep up sometimes. This happened about an hour ago:

image
https://dashboard.open-registry.dev/d/7SmAxSzZz/general?orgId=1&from=1558240837670&to=1558262437670

ConnMgr values from config is currently:

{
    "GracePeriod": "20s",
    "HighWater": 5000,
    "LowWater": 2500,
    "Type": "basic"
}

Seems while the spike happened, memory was taken but not given back later.

@victorb
Copy link
Member Author

victorb commented May 22, 2019

Update after 3 days (graph is last 7 days):

image
https://dashboard.open-registry.dev/d/7SmAxSzZz/general?orgId=1&from=1557920925634&to=1558523884539

@vyzo @raulk seems that go-ipfs is still not really working properly... There is now 40k peers connected, using 0.2 of the CPU available and memory is growing past 10GB.

Seems the connection manager still isn't disconnecting as many peers as it needs to, even after applying patch linked by @vyzo above.

Edit: CPU usage seems much better than before applying patch above, but still go-ipfs is basically taking over the servers resources as the connection manager doesn't respect the values/cannot close enough peers

@victorb victorb added the breaking-production Issues that needs to be fixed ASAP as they affect production negatively label May 22, 2019
@vyzo
Copy link

vyzo commented May 22, 2019

this is weird. the only possible explanation is that the connection manager gets stuck, which is an issue @Stebalien has identified.

@vyzo
Copy link

vyzo commented May 22, 2019

We are working on a fix with libp2p/go-libp2p-circuit#76

@vyzo
Copy link

vyzo commented May 22, 2019

see ipfs/kubo#6237 -- can you try building with go-ipfs master? It has the relevant patches applied.

@victorb
Copy link
Member Author

victorb commented May 22, 2019

@vyzo Thanks a lot. Will do a deploy of go-ipfs master tomorrow morning and see if it improves the situation.

@victorb
Copy link
Member Author

victorb commented May 24, 2019

Now ipfs/go-ipfs:v0.4.21-rc3 has been deployed. Let's see how it holds up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking-production Issues that needs to be fixed ASAP as they affect production negatively bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants