-
-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Improve ResourceManager UX #9338
Conversation
d5e3765
to
e362445
Compare
e362445
to
ad81a06
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks good to me from a log message regard. I'll let @guseggert give the approval from a code regard.
082af2c
to
ab2187c
Compare
@ajnavarro and I had a verbal on 2022-11-09. I'm going to work now on adding some comments and improvements to rcmgr_defaults.go. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also need to update https://github.com/ipfs/kubo/blob/master/docs/config.md#swarmresourcemgr
- Remove "experimental" notes
- Link to updated resource manager location (now that the resource manager is in the go-libp2p monorepo)
- Discuss being able to set incremental limits? Although I guess if someone is doing expert mode of specifying all their limits, we won't be doing scaling based on resources. They need to do this on their own. We should call this out.
We also need a changelog update, but I'm fine if that's a separate PR.
I would like to review again once feedback is incorporated. I'm also good to verbally connect so close this out today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great @ajnavarro ! Thanks for your persistence here!
The default limits look good to me, and my approval is based on them.
I assume you'll incorporate the relevant bit of my feedback below, but don't block on me needing to see another round. If you address the comments, you should ship from my perspective.
Please make sure from a code perspective that you got a signoff from someone like @guseggert . I see he reviewed and I know you guys spoke verbally, but I don't know if there's anything else he want to see before approving.
The other things I'd love to see if PR comments on what testing you have done. I'm hoping this passes when you use the accelerated DHT client.
I think it would also be good to specify the "attack script" configuration you used. (We obviously won't paste in the attack script itself.) It would be great to show that even with lots of peers, the node doesn't fall over.
For example, a comment like this would be great:
With default resource manager configuration, the node spun up and built a routing table when the accelerated client was used. There were no errors in the logs.
These "attacks" were tried against the default configuration but the node stayed responsive.
I ran:
./attack-script.sh --numPeers=5 --numConnectionsPeer=100 --numStreamsPerConnection=100
./attack-script.sh --numPeers=100 --numConnectionsPeer=100 --numStreamsPerConnection=100
In both cases, I was still to do the following when the node was "under attack":
ipfs add file
ipfs get file
curl URL_FOR_GATEWAY_RETRIEVAL
The above is just an idea. You don't have to follow it exactly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comments I left are not blockers, feel free to incorporate them in this PR or in a subsequent PR, we don't need to block the RC release train for this. Other than Steve's comments, LGTM!
This PR adds several new functionalities to make easier the usage of ResourceManager: - Now resource manager logs when resources are exceeded are on ERROR instead of warning. - The resources exceeded error now show what kind of limit was reached - When there was no limit exceeded, we print a message for the user saying that limits were back to normal - Added `swarm limit all` command to show all set limits with the same format as `swarm stats all` - Added `min-used-limit-perc` option to `swarm stats all` to only show stats that are above a specific percentage
Signed-off-by: Antonio Navarro Perez <[email protected]>
Signed-off-by: Antonio Navarro Perez <[email protected]>
Signed-off-by: Antonio Navarro Perez <[email protected]>
Signed-off-by: Antonio Navarro Perez <[email protected]>
Signed-off-by: Antonio Navarro Perez <[email protected]>
Signed-off-by: Antonio Navarro Perez <[email protected]>
Signed-off-by: Antonio Navarro Perez <[email protected]>
Signed-off-by: Antonio Navarro Perez <[email protected]>
Signed-off-by: Antonio Navarro <[email protected]>
Signed-off-by: Antonio Navarro <[email protected]>
03614f4
to
f9b7a79
Compare
@ajnavarro : thanks for adding the "Validation tests" section to the PR. The dimension I think we need to cal out is "number of peers". I assume right now your validation test had a small number of peers. That is is good/fine since it ensures we are protected from unintentional DoS from misbehaving nodes. A malicious attacker will just increase the number of peerids it uses to then hit system limits. I want to make sure that the system scope limits in place still keep the node responsive. In the comment above, I gave an example of what I think we want to see:
I'm not saying those are the right number, but we want to do something like this, where the goal is to test the "system scope" limits, rather than just the "peer scope" limits. Thanks! |
I modified the attack script to be able to create several hosts at the same time. I did a couple more tests:
My computer cannot keep up with 1000 hosts simultaneously ( only ~500 at any point in time). Kubo and the attackers were running on the same pc. With this amount of hosts, we are hitting system and transient scopes:
The kubo node is still able to process and do operations, but you can feel the choppiness. |
@ajnavarro : thanks for the update - this is useful/great. A couple of things:
Thanks! |
@BigLep Kubo and the attackers were running on the same PC. I updated the comment. I'll upstream my changes. |
Thanks @ajnavarro. I'm trying to think through the ramification of the attack node and Kubo node being on the same host. They're inevitably competing with each other for resources, and I wonder how much it affects the validity of the test. I'm not sure how to reason about it. I think it would be cleanest if we had two separate hosts setups (e.g., run the test on a cloud provider). |
@BigLep even if they are competing for resources, the tests are clear: in all cases, if the resource manager is off, the node is killed by the OS for OOM. When the resource manager is on, the node works as described before. The objective of these manual tests was not to test all corner cases (because there was no time for that, we wanted to have it on 0.17) but to check that ResourceManager was doing something and was effectively working on protecting the node. We need to have more in-deep tests, maybe using our CI, or maybe using testground to check several attack types. |
This PR adds several new functionalities to make easier the usage of ResourceManager:
swarm limit all
command to show all set limits with the same format asswarm stats all
min-used-limit-perc
option toswarm stats all
to only show stats that are above a specific percentageOutput example:
Validation tests
It closes #9001
It closes #9351
It closes #9322