Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Improve ResourceManager UX #9338

Merged
merged 14 commits into from
Nov 10, 2022
Merged

Conversation

ajnavarro
Copy link
Member

@ajnavarro ajnavarro commented Oct 7, 2022

This PR adds several new functionalities to make easier the usage of ResourceManager:

  • Now resource manager logs when resources are exceeded are on ERROR instead of warning.
  • The resources exceeded error now shows what kind of limit was reached and the scope.
  • When there was no limit exceeded, we print a message for the user saying that limits are not exceeded anymore.
  • Added swarm limit all command to show all set limits with the same format as swarm stats all
  • Added min-used-limit-perc option to swarm stats all to only show stats that are above a specific percentage
  • Simplify a lot default values.
  • Enable ResourceManager by default.

Output example:

2022-11-09T10:51:40.565+0100    ERROR   resourcemanager libp2p/rcmgr_logging.go:59      Consider inspecting logs and raising the resource manager limits. Documentation: https://github.com/ipfs/kubo/blob/master/docs/config.md#swarmresourcemgr
2022-11-09T10:51:50.565+0100    ERROR   resourcemanager libp2p/rcmgr_logging.go:55      Resource limits were exceeded 483095 times with error "transient: cannot reserve inbound stream: resource limit exceeded".
2022-11-09T10:51:50.565+0100    ERROR   resourcemanager libp2p/rcmgr_logging.go:59      Consider inspecting logs and raising the resource manager limits. Documentation: https://github.com/ipfs/kubo/blob/master/docs/config.md#swarmresourcemgr
2022-11-09T10:52:00.565+0100    ERROR   resourcemanager libp2p/rcmgr_logging.go:55      Resource limits were exceeded 455294 times with error "transient: cannot reserve inbound stream: resource limit exceeded".
2022-11-09T10:52:00.565+0100    ERROR   resourcemanager libp2p/rcmgr_logging.go:59      Consider inspecting logs and raising the resource manager limits. Documentation: https://github.com/ipfs/kubo/blob/master/docs/config.md#swarmresourcemgr
2022-11-09T10:52:10.565+0100    ERROR   resourcemanager libp2p/rcmgr_logging.go:55      Resource limits were exceeded 471384 times with error "transient: cannot reserve inbound stream: resource limit exceeded".
2022-11-09T10:52:10.565+0100    ERROR   resourcemanager libp2p/rcmgr_logging.go:59      Consider inspecting logs and raising the resource manager limits. Documentation: https://github.com/ipfs/kubo/blob/master/docs/config.md#swarmresourcemgr
2022-11-09T10:52:20.565+0100    ERROR   resourcemanager libp2p/rcmgr_logging.go:55      Resource limits were exceeded 8 times with error "peer:12D3KooWKqcaBtcmZKLKCCoDPBuA6AXGJMNrLQUPPMsA5Q6D1eG6: cannot reserve inbound stream: resource limit exceeded".
2022-11-09T10:52:20.565+0100    ERROR   resourcemanager libp2p/rcmgr_logging.go:55      Resource limits were exceeded 192 times with error "peer:12D3KooWPjetWPGQUih9LZTGHdyAM9fKaXtUxDyBhA93E3JAWCXj: cannot reserve inbound stream: resource limit exceeded".
2022-11-09T10:52:20.565+0100    ERROR   resourcemanager libp2p/rcmgr_logging.go:55      Resource limits were exceeded 469746 times with error "transient: cannot reserve inbound stream: resource limit exceeded".
2022-11-09T10:52:20.565+0100    ERROR   resourcemanager libp2p/rcmgr_logging.go:59      Consider inspecting logs and raising the resource manager limits. Documentation: https://github.com/ipfs/kubo/blob/master/docs/config.md#swarmresourcemgr
2022-11-09T10:52:30.565+0100    ERROR   resourcemanager libp2p/rcmgr_logging.go:55      Resource limits were exceeded 484137 times with error "transient: cannot reserve inbound stream: resource limit exceeded".
2022-11-09T10:52:30.565+0100    ERROR   resourcemanager libp2p/rcmgr_logging.go:55      Resource limits were exceeded 29 times with error "peer:12D3KooWPjetWPGQUih9LZTGHdyAM9fKaXtUxDyBhA93E3JAWCXj: cannot reserve inbound stream: resource limit exceeded".
2022-11-09T10:52:30.565+0100    ERROR   resourcemanager libp2p/rcmgr_logging.go:59      Consider inspecting logs and raising the resource manager limits. Documentation: https://github.com/ipfs/kubo/blob/master/docs/config.md#swarmresourcemgr
2022-11-09T10:52:40.565+0100    ERROR   resourcemanager libp2p/rcmgr_logging.go:55      Resource limits were exceeded 468843 times with error "transient: cannot reserve inbound stream: resource limit exceeded".
2022-11-09T10:52:40.566+0100    ERROR   resourcemanager libp2p/rcmgr_logging.go:59      Consider inspecting logs and raising the resource manager limits. Documentation: https://github.com/ipfs/kubo/blob/master/docs/config.md#swarmresourcemgr
2022-11-09T10:52:50.566+0100    ERROR   resourcemanager libp2p/rcmgr_logging.go:55      Resource limits were exceeded 366638 times with error "transient: cannot reserve inbound stream: resource limit exceeded".
2022-11-09T10:52:50.566+0100    ERROR   resourcemanager libp2p/rcmgr_logging.go:59      Consider inspecting logs and raising the resource manager limits. Documentation: https://github.com/ipfs/kubo/blob/master/docs/config.md#swarmresourcemgr
2022-11-09T10:53:00.566+0100    ERROR   resourcemanager libp2p/rcmgr_logging.go:55      Resource limits were exceeded 405526 times with error "transient: cannot reserve inbound stream: resource limit exceeded".
2022-11-09T10:53:00.566+0100    ERROR   resourcemanager libp2p/rcmgr_logging.go:55      Resource limits were exceeded 107 times with error "peer:12D3KooWQZQCwevTDGhkE9iGYk5sBzWRDUSX68oyrcfM9tXyrs2Q: cannot reserve inbound stream: resource limit exceeded".
2022-11-09T10:53:00.566+0100    ERROR   resourcemanager libp2p/rcmgr_logging.go:59      Consider inspecting logs and raising the resource manager limits. Documentation: https://github.com/ipfs/kubo/blob/master/docs/config.md#swarmresourcemgr
2022-11-09T10:53:10.566+0100    ERROR   resourcemanager libp2p/rcmgr_logging.go:55      Resource limits were exceeded 336923 times with error "transient: cannot reserve inbound stream: resource limit exceeded".
2022-11-09T10:53:10.566+0100    ERROR   resourcemanager libp2p/rcmgr_logging.go:59      Consider inspecting logs and raising the resource manager limits. Documentation: https://github.com/ipfs/kubo/blob/master/docs/config.md#swarmresourcemgr
2022-11-09T10:53:20.565+0100    ERROR   resourcemanager libp2p/rcmgr_logging.go:55      Resource limits were exceeded 71 times with error "transient: cannot reserve inbound stream: resource limit exceeded".
2022-11-09T10:53:20.565+0100    ERROR   resourcemanager libp2p/rcmgr_logging.go:59      Consider inspecting logs and raising the resource manager limits. Documentation: https://github.com/ipfs/kubo/blob/master/docs/config.md#swarmresourcemgr
2022-11-09T10:53:30.565+0100    ERROR   resourcemanager libp2p/rcmgr_logging.go:64      Resrouce limits are no longer being exceeded.

Validation tests

  • Accelerated DHT client runs with no errors when ResourceManager is active. No problems were observed.
  • Running an attack with 200 connections and 1M streams using yamux protocol. Node was usable during the attack. With ResourceManager deactivated, the node was killed by the OS because of the amount of memory consumed.
    • Actions done when the attack was active:
      • Add files
      • Force a reprovide
      • Use the gateway to resolve an IPNS address.

It closes #9001
It closes #9351
It closes #9322

@ajnavarro ajnavarro changed the title Improve ErrorManager UX feat: Improve ErrorManager UX Oct 7, 2022
@ajnavarro ajnavarro changed the title feat: Improve ErrorManager UX feat: Improve ResourceManager UX Oct 10, 2022
@ajnavarro ajnavarro marked this pull request as ready for review October 11, 2022 09:27
core/node/libp2p/rcmgr_logging.go Outdated Show resolved Hide resolved
core/node/libp2p/rcmgr_logging.go Outdated Show resolved Hide resolved
@ajnavarro ajnavarro force-pushed the feat/improve-resource-manager-ux branch from d5e3765 to e362445 Compare October 19, 2022 10:42
@ajnavarro ajnavarro requested a review from BigLep October 19, 2022 10:43
@ajnavarro ajnavarro force-pushed the feat/improve-resource-manager-ux branch from e362445 to ad81a06 Compare October 19, 2022 11:00
Copy link
Contributor

@BigLep BigLep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good to me from a log message regard. I'll let @guseggert give the approval from a code regard.

@BigLep BigLep mentioned this pull request Nov 8, 2022
@ajnavarro ajnavarro force-pushed the feat/improve-resource-manager-ux branch 4 times, most recently from 082af2c to ab2187c Compare November 9, 2022 12:50
@BigLep
Copy link
Contributor

BigLep commented Nov 9, 2022

@ajnavarro and I had a verbal on 2022-11-09. I'm going to work now on adding some comments and improvements to rcmgr_defaults.go.

Copy link
Contributor

@BigLep BigLep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need to update https://github.com/ipfs/kubo/blob/master/docs/config.md#swarmresourcemgr

  1. Remove "experimental" notes
  2. Link to updated resource manager location (now that the resource manager is in the go-libp2p monorepo)
  3. Discuss being able to set incremental limits? Although I guess if someone is doing expert mode of specifying all their limits, we won't be doing scaling based on resources. They need to do this on their own. We should call this out.

We also need a changelog update, but I'm fine if that's a separate PR.

I would like to review again once feedback is incorporated. I'm also good to verbally connect so close this out today.

core/node/libp2p/rcmgr_defaults.go Outdated Show resolved Hide resolved
core/node/libp2p/rcmgr_defaults.go Outdated Show resolved Hide resolved
core/node/libp2p/rcmgr_defaults.go Outdated Show resolved Hide resolved
core/node/libp2p/rcmgr_defaults.go Outdated Show resolved Hide resolved
core/node/libp2p/rcmgr_defaults.go Show resolved Hide resolved
core/node/libp2p/rcmgr_defaults.go Outdated Show resolved Hide resolved
core/node/libp2p/rcmgr_defaults.go Outdated Show resolved Hide resolved
core/node/libp2p/rcmgr_defaults.go Outdated Show resolved Hide resolved
core/node/libp2p/rcmgr_defaults.go Outdated Show resolved Hide resolved
core/node/libp2p/rcmgr_defaults.go Outdated Show resolved Hide resolved
Copy link
Contributor

@BigLep BigLep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great @ajnavarro ! Thanks for your persistence here!

The default limits look good to me, and my approval is based on them.

I assume you'll incorporate the relevant bit of my feedback below, but don't block on me needing to see another round. If you address the comments, you should ship from my perspective.

Please make sure from a code perspective that you got a signoff from someone like @guseggert . I see he reviewed and I know you guys spoke verbally, but I don't know if there's anything else he want to see before approving.

The other things I'd love to see if PR comments on what testing you have done. I'm hoping this passes when you use the accelerated DHT client.

I think it would also be good to specify the "attack script" configuration you used. (We obviously won't paste in the attack script itself.) It would be great to show that even with lots of peers, the node doesn't fall over.

For example, a comment like this would be great:

With default resource manager configuration, the node spun up and built a routing table when the accelerated client was used.  There were no errors in the logs.

These "attacks" were tried against the default configuration but the node stayed responsive.

I ran:
./attack-script.sh --numPeers=5 --numConnectionsPeer=100 --numStreamsPerConnection=100
./attack-script.sh --numPeers=100 --numConnectionsPeer=100 --numStreamsPerConnection=100

In both cases, I was still to do the following when the node was "under attack":

ipfs add file
ipfs get file
curl URL_FOR_GATEWAY_RETRIEVAL

The above is just an idea. You don't have to follow it exactly.

core/node/libp2p/rcmgr.go Outdated Show resolved Hide resolved
core/node/libp2p/rcmgr_defaults.go Outdated Show resolved Hide resolved
core/node/libp2p/rcmgr_defaults.go Outdated Show resolved Hide resolved
docs/config.md Outdated Show resolved Hide resolved
docs/config.md Show resolved Hide resolved
docs/config.md Outdated Show resolved Hide resolved
docs/config.md Outdated Show resolved Hide resolved
docs/config.md Outdated Show resolved Hide resolved
docs/config.md Outdated Show resolved Hide resolved
test/sharness/t0139-swarm-rcmgr.sh Outdated Show resolved Hide resolved
docs/config.md Outdated Show resolved Hide resolved
core/node/libp2p/rcmgr_logging.go Outdated Show resolved Hide resolved
Copy link
Contributor

@guseggert guseggert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comments I left are not blockers, feel free to incorporate them in this PR or in a subsequent PR, we don't need to block the RC release train for this. Other than Steve's comments, LGTM!

This PR adds several new functionalities to make easier the usage of
ResourceManager:

- Now resource manager logs when resources are exceeded are on ERROR
  instead of warning.
- The resources exceeded error now show what kind of limit was reached
- When there was no limit exceeded, we print a message for the user
  saying that limits were back to normal
- Added `swarm limit all` command to show all set limits with the same
  format as `swarm stats all`
- Added `min-used-limit-perc` option to `swarm stats all` to only show
  stats that are above a specific percentage
Signed-off-by: Antonio Navarro Perez <[email protected]>
Signed-off-by: Antonio Navarro Perez <[email protected]>
Signed-off-by: Antonio Navarro Perez <[email protected]>
Signed-off-by: Antonio Navarro Perez <[email protected]>
Signed-off-by: Antonio Navarro Perez <[email protected]>
Signed-off-by: Antonio Navarro Perez <[email protected]>
Signed-off-by: Antonio Navarro Perez <[email protected]>
Signed-off-by: Antonio Navarro <[email protected]>
Signed-off-by: Antonio Navarro <[email protected]>
@ajnavarro ajnavarro force-pushed the feat/improve-resource-manager-ux branch from 03614f4 to f9b7a79 Compare November 10, 2022 10:32
@ajnavarro ajnavarro merged commit 254d81a into master Nov 10, 2022
@ajnavarro ajnavarro deleted the feat/improve-resource-manager-ux branch November 10, 2022 11:26
@BigLep
Copy link
Contributor

BigLep commented Nov 10, 2022

@ajnavarro : thanks for adding the "Validation tests" section to the PR. The dimension I think we need to cal out is "number of peers". I assume right now your validation test had a small number of peers. That is is good/fine since it ensures we are protected from unintentional DoS from misbehaving nodes. A malicious attacker will just increase the number of peerids it uses to then hit system limits. I want to make sure that the system scope limits in place still keep the node responsive.

In the comment above, I gave an example of what I think we want to see:

./attack-script.sh --numPeers=5 --numConnectionsPeer=100 --numStreamsPerConnection=100
./attack-script.sh --numPeers=100 --numConnectionsPeer=100 --numStreamsPerConnection=100

I'm not saying those are the right number, but we want to do something like this, where the goal is to test the "system scope" limits, rather than just the "peer scope" limits. Thanks!

@ajnavarro
Copy link
Member Author

ajnavarro commented Nov 11, 2022

I modified the attack script to be able to create several hosts at the same time.

I did a couple more tests:

  • 100 hosts, 100 conns per host, 100000 streams per host
  • 1000 hosts, 10 conns per host, 10000 streams per host

My computer cannot keep up with 1000 hosts simultaneously ( only ~500 at any point in time).

Kubo and the attackers were running on the same pc.

With this amount of hosts, we are hitting system and transient scopes:

2022-11-11T13:32:31.918+0100	ERROR	resourcemanager	libp2p/rcmgr_logging.go:53	Resource limits were exceeded 201763 times with error "transient: cannot reserve inbound stream: resource limit exceeded".
2022-11-11T13:32:31.918+0100	ERROR	resourcemanager	libp2p/rcmgr_logging.go:53	Resource limits were exceeded 274 times with error "system: cannot reserve inbound connection: resource limit exceeded".

The kubo node is still able to process and do operations, but you can feel the choppiness.

screenshot-127 0 0 1_5001-2022 11 11-13_22_02

@BigLep
Copy link
Contributor

BigLep commented Nov 11, 2022

@ajnavarro : thanks for the update - this is useful/great. A couple of things:

  1. Was your attacker node and your Kubo node the same physical host or separate? (Lets describe the setup a little more.)
  2. Please upstream your attack script change if possible because others should ideally be testing with large number of peers (not just large numbers of connectionsPerPeer streamsPerConnection).

Thanks!

@ajnavarro
Copy link
Member Author

@BigLep Kubo and the attackers were running on the same PC. I updated the comment.

I'll upstream my changes.

@BigLep
Copy link
Contributor

BigLep commented Nov 11, 2022

Thanks @ajnavarro. I'm trying to think through the ramification of the attack node and Kubo node being on the same host. They're inevitably competing with each other for resources, and I wonder how much it affects the validity of the test. I'm not sure how to reason about it. I think it would be cleanest if we had two separate hosts setups (e.g., run the test on a cloud provider).

@ajnavarro
Copy link
Member Author

@BigLep even if they are competing for resources, the tests are clear: in all cases, if the resource manager is off, the node is killed by the OS for OOM. When the resource manager is on, the node works as described before.

The objective of these manual tests was not to test all corner cases (because there was no time for that, we wanted to have it on 0.17) but to check that ResourceManager was doing something and was effectively working on protecting the node.

We need to have more in-deep tests, maybe using our CI, or maybe using testground to check several attack types.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Discussion: ResourceManager defaults when active by default Improve resource manager UX
3 participants