Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update index.md #940

Merged
merged 2 commits into from
Mar 18, 2022
Merged

Update index.md #940

merged 2 commits into from
Mar 18, 2022

Conversation

Clay-Mysten
Copy link
Contributor

Fix error in About Sui link

Fix error in About Sui link
Point out Source Code link for direct contributions
Copy link
Contributor

@AllyMedina1 AllyMedina1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generate* a pull request

@Clay-Mysten Clay-Mysten merged commit 2c3156d into main Mar 18, 2022
@Clay-Mysten Clay-Mysten deleted the Clay-Mysten-patch-10 branch March 18, 2022 23:07
mwtian pushed a commit that referenced this pull request Sep 12, 2022
* Do not block workers during unreliable_send

`UnreliableNetwork::unreliable_send` invokes blocking RPC on remote peer in order to send the message. In order to make the RPC call `unreliable_send` spawns a new task on the bounded executor, so that `unreliable_send` normally is not blocked on remote peer processing the RPC.

However, if remote peer is slow or unresponsive, the bounded executor can get full. In this case `UnreliableNetwork::unreliable_send` will block the caller until executor has permits available.

This leads to a situation where single slow/unresponsive peer can block entire critical task by not replying on send message RPC, if too many of them are in-flight.

We [see](http://grafana.shared.internal.sui.io:3000/explore?left=%7B%22datasource%22:%22zIREmpMVz%22,%22queries%22:%5B%7B%22expr%22:%22%7B__name__%3D~%5C%22tx_.%2A%5C%22,job%3D%5C%22mh1-suival-2%5C%22%7D%3E200%22,%22refId%22:%22A%22,%22interval%22:%22%22,%22editorMode%22:%22code%22,%22range%22:true,%22instant%22:true%7D,%7B%22refId%22:%22B%22,%22editorMode%22:%22code%22,%22expr%22:%22primary_network_available_tasks%7Bjob%3D%5C%22mh1-suival-2%5C%22%7D%22,%22legendFormat%22:%22__auto%22,%22range%22:true,%22instant%22:true%7D%5D,%22range%22:%7B%22from%22:%221662553513874%22,%22to%22:%221662586403070%22%7D%7D&orgId=1) this happening in the stuck cluster. The single slow peer `ty6-suival-1` here completely blocks `helper` task on `mh1-suival-2`, so that this validator can not reply to requests from other (healthy) peer.

The solution is to use `try_spawn` and if the bounded executor for worker is full we simply drop the message.

It is hard to predict whether this particular fix will completely solve the issue we are seeing, but the behavior seem to be more in line with how BFT system should handle unresponsive peers.

Some implementation details:

Ideally we want to just remove return value from `UnreliableNetwork::unreliable_send`, but it is currently used by some unit tests and reconfigure.rs. We can figure out later on how to fix this.

* fix lint

* fix: make the UnreliableNetwork not return join handles, switch reconfig messages Primary -> Worker to the Reliable network

* feat: add a test to check implementations of unreliable send don't block

Co-authored-by: Andrey Chursin <[email protected]>
mwtian pushed a commit to mwtian/sui that referenced this pull request Sep 29, 2022
* Do not block workers during unreliable_send

`UnreliableNetwork::unreliable_send` invokes blocking RPC on remote peer in order to send the message. In order to make the RPC call `unreliable_send` spawns a new task on the bounded executor, so that `unreliable_send` normally is not blocked on remote peer processing the RPC.

However, if remote peer is slow or unresponsive, the bounded executor can get full. In this case `UnreliableNetwork::unreliable_send` will block the caller until executor has permits available.

This leads to a situation where single slow/unresponsive peer can block entire critical task by not replying on send message RPC, if too many of them are in-flight.

We [see](http://grafana.shared.internal.sui.io:3000/explore?left=%7B%22datasource%22:%22zIREmpMVz%22,%22queries%22:%5B%7B%22expr%22:%22%7B__name__%3D~%5C%22tx_.%2A%5C%22,job%3D%5C%22mh1-suival-2%5C%22%7D%3E200%22,%22refId%22:%22A%22,%22interval%22:%22%22,%22editorMode%22:%22code%22,%22range%22:true,%22instant%22:true%7D,%7B%22refId%22:%22B%22,%22editorMode%22:%22code%22,%22expr%22:%22primary_network_available_tasks%7Bjob%3D%5C%22mh1-suival-2%5C%22%7D%22,%22legendFormat%22:%22__auto%22,%22range%22:true,%22instant%22:true%7D%5D,%22range%22:%7B%22from%22:%221662553513874%22,%22to%22:%221662586403070%22%7D%7D&orgId=1) this happening in the stuck cluster. The single slow peer `ty6-suival-1` here completely blocks `helper` task on `mh1-suival-2`, so that this validator can not reply to requests from other (healthy) peer.

The solution is to use `try_spawn` and if the bounded executor for worker is full we simply drop the message.

It is hard to predict whether this particular fix will completely solve the issue we are seeing, but the behavior seem to be more in line with how BFT system should handle unresponsive peers.

Some implementation details:

Ideally we want to just remove return value from `UnreliableNetwork::unreliable_send`, but it is currently used by some unit tests and reconfigure.rs. We can figure out later on how to fix this.

* fix lint

* fix: make the UnreliableNetwork not return join handles, switch reconfig messages Primary -> Worker to the Reliable network

* feat: add a test to check implementations of unreliable send don't block

Co-authored-by: Andrey Chursin <[email protected]>
mwtian pushed a commit to mwtian/sui that referenced this pull request Sep 29, 2022
When converting to using anemo, we accidentally reverted MystenLabs#940 which
changed unreliable_send send to not block when the executor limit had
been reached. This PR addresses that mistake by re-introducing that fix.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants