-
Notifications
You must be signed in to change notification settings - Fork 11.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update index.md #940
Merged
Merged
Update index.md #940
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Fix error in About Sui link
Point out Source Code link for direct contributions
AllyMedina1
reviewed
Mar 18, 2022
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
generate* a pull request
AllyMedina1
approved these changes
Mar 18, 2022
mwtian
pushed a commit
that referenced
this pull request
Sep 12, 2022
* Do not block workers during unreliable_send `UnreliableNetwork::unreliable_send` invokes blocking RPC on remote peer in order to send the message. In order to make the RPC call `unreliable_send` spawns a new task on the bounded executor, so that `unreliable_send` normally is not blocked on remote peer processing the RPC. However, if remote peer is slow or unresponsive, the bounded executor can get full. In this case `UnreliableNetwork::unreliable_send` will block the caller until executor has permits available. This leads to a situation where single slow/unresponsive peer can block entire critical task by not replying on send message RPC, if too many of them are in-flight. We [see](http://grafana.shared.internal.sui.io:3000/explore?left=%7B%22datasource%22:%22zIREmpMVz%22,%22queries%22:%5B%7B%22expr%22:%22%7B__name__%3D~%5C%22tx_.%2A%5C%22,job%3D%5C%22mh1-suival-2%5C%22%7D%3E200%22,%22refId%22:%22A%22,%22interval%22:%22%22,%22editorMode%22:%22code%22,%22range%22:true,%22instant%22:true%7D,%7B%22refId%22:%22B%22,%22editorMode%22:%22code%22,%22expr%22:%22primary_network_available_tasks%7Bjob%3D%5C%22mh1-suival-2%5C%22%7D%22,%22legendFormat%22:%22__auto%22,%22range%22:true,%22instant%22:true%7D%5D,%22range%22:%7B%22from%22:%221662553513874%22,%22to%22:%221662586403070%22%7D%7D&orgId=1) this happening in the stuck cluster. The single slow peer `ty6-suival-1` here completely blocks `helper` task on `mh1-suival-2`, so that this validator can not reply to requests from other (healthy) peer. The solution is to use `try_spawn` and if the bounded executor for worker is full we simply drop the message. It is hard to predict whether this particular fix will completely solve the issue we are seeing, but the behavior seem to be more in line with how BFT system should handle unresponsive peers. Some implementation details: Ideally we want to just remove return value from `UnreliableNetwork::unreliable_send`, but it is currently used by some unit tests and reconfigure.rs. We can figure out later on how to fix this. * fix lint * fix: make the UnreliableNetwork not return join handles, switch reconfig messages Primary -> Worker to the Reliable network * feat: add a test to check implementations of unreliable send don't block Co-authored-by: Andrey Chursin <[email protected]>
mwtian
pushed a commit
to mwtian/sui
that referenced
this pull request
Sep 29, 2022
* Do not block workers during unreliable_send `UnreliableNetwork::unreliable_send` invokes blocking RPC on remote peer in order to send the message. In order to make the RPC call `unreliable_send` spawns a new task on the bounded executor, so that `unreliable_send` normally is not blocked on remote peer processing the RPC. However, if remote peer is slow or unresponsive, the bounded executor can get full. In this case `UnreliableNetwork::unreliable_send` will block the caller until executor has permits available. This leads to a situation where single slow/unresponsive peer can block entire critical task by not replying on send message RPC, if too many of them are in-flight. We [see](http://grafana.shared.internal.sui.io:3000/explore?left=%7B%22datasource%22:%22zIREmpMVz%22,%22queries%22:%5B%7B%22expr%22:%22%7B__name__%3D~%5C%22tx_.%2A%5C%22,job%3D%5C%22mh1-suival-2%5C%22%7D%3E200%22,%22refId%22:%22A%22,%22interval%22:%22%22,%22editorMode%22:%22code%22,%22range%22:true,%22instant%22:true%7D,%7B%22refId%22:%22B%22,%22editorMode%22:%22code%22,%22expr%22:%22primary_network_available_tasks%7Bjob%3D%5C%22mh1-suival-2%5C%22%7D%22,%22legendFormat%22:%22__auto%22,%22range%22:true,%22instant%22:true%7D%5D,%22range%22:%7B%22from%22:%221662553513874%22,%22to%22:%221662586403070%22%7D%7D&orgId=1) this happening in the stuck cluster. The single slow peer `ty6-suival-1` here completely blocks `helper` task on `mh1-suival-2`, so that this validator can not reply to requests from other (healthy) peer. The solution is to use `try_spawn` and if the bounded executor for worker is full we simply drop the message. It is hard to predict whether this particular fix will completely solve the issue we are seeing, but the behavior seem to be more in line with how BFT system should handle unresponsive peers. Some implementation details: Ideally we want to just remove return value from `UnreliableNetwork::unreliable_send`, but it is currently used by some unit tests and reconfigure.rs. We can figure out later on how to fix this. * fix lint * fix: make the UnreliableNetwork not return join handles, switch reconfig messages Primary -> Worker to the Reliable network * feat: add a test to check implementations of unreliable send don't block Co-authored-by: Andrey Chursin <[email protected]>
mwtian
pushed a commit
to mwtian/sui
that referenced
this pull request
Sep 29, 2022
When converting to using anemo, we accidentally reverted MystenLabs#940 which changed unreliable_send send to not block when the executor limit had been reached. This PR addresses that mistake by re-introducing that fix.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fix error in About Sui link