Update index.md #940

Clay-Mysten · 2022-03-18T22:12:38Z

Fix error in About Sui link

Point out Source Code link for direct contributions

AllyMedina1

generate* a pull request

* Do not block workers during unreliable_send `UnreliableNetwork::unreliable_send` invokes blocking RPC on remote peer in order to send the message. In order to make the RPC call `unreliable_send` spawns a new task on the bounded executor, so that `unreliable_send` normally is not blocked on remote peer processing the RPC. However, if remote peer is slow or unresponsive, the bounded executor can get full. In this case `UnreliableNetwork::unreliable_send` will block the caller until executor has permits available. This leads to a situation where single slow/unresponsive peer can block entire critical task by not replying on send message RPC, if too many of them are in-flight. We [see](http://grafana.shared.internal.sui.io:3000/explore?left=%7B%22datasource%22:%22zIREmpMVz%22,%22queries%22:%5B%7B%22expr%22:%22%7B__name__%3D~%5C%22tx_.%2A%5C%22,job%3D%5C%22mh1-suival-2%5C%22%7D%3E200%22,%22refId%22:%22A%22,%22interval%22:%22%22,%22editorMode%22:%22code%22,%22range%22:true,%22instant%22:true%7D,%7B%22refId%22:%22B%22,%22editorMode%22:%22code%22,%22expr%22:%22primary_network_available_tasks%7Bjob%3D%5C%22mh1-suival-2%5C%22%7D%22,%22legendFormat%22:%22__auto%22,%22range%22:true,%22instant%22:true%7D%5D,%22range%22:%7B%22from%22:%221662553513874%22,%22to%22:%221662586403070%22%7D%7D&orgId=1) this happening in the stuck cluster. The single slow peer `ty6-suival-1` here completely blocks `helper` task on `mh1-suival-2`, so that this validator can not reply to requests from other (healthy) peer. The solution is to use `try_spawn` and if the bounded executor for worker is full we simply drop the message. It is hard to predict whether this particular fix will completely solve the issue we are seeing, but the behavior seem to be more in line with how BFT system should handle unresponsive peers. Some implementation details: Ideally we want to just remove return value from `UnreliableNetwork::unreliable_send`, but it is currently used by some unit tests and reconfigure.rs. We can figure out later on how to fix this. * fix lint * fix: make the UnreliableNetwork not return join handles, switch reconfig messages Primary -> Worker to the Reliable network * feat: add a test to check implementations of unreliable send don't block Co-authored-by: Andrey Chursin <[email protected]>

When converting to using anemo, we accidentally reverted MystenLabs#940 which changed unreliable_send send to not block when the executor limit had been reached. This PR addresses that mistake by re-introducing that fix.

Update index.md

927f3e0

Fix error in About Sui link

Clay-Mysten requested review from bwann52 and AllyMedina1 March 18, 2022 22:12

Update index.md

2bdfd50

Point out Source Code link for direct contributions

AllyMedina1 reviewed Mar 18, 2022

View reviewed changes

AllyMedina1 approved these changes Mar 18, 2022

View reviewed changes

Clay-Mysten merged commit 2c3156d into main Mar 18, 2022

Clay-Mysten deleted the Clay-Mysten-patch-10 branch March 18, 2022 23:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update index.md #940

Update index.md #940

Clay-Mysten commented Mar 18, 2022

AllyMedina1 left a comment

Update index.md #940

Update index.md #940

Conversation

Clay-Mysten commented Mar 18, 2022

AllyMedina1 left a comment

Choose a reason for hiding this comment