-
Notifications
You must be signed in to change notification settings - Fork 684
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sending large quantities of transactions slows the network down #3110
Labels
A-chain
Area: Chain, client & related
Comments
nearprotocol-bulldozer bot
pushed a commit
that referenced
this issue
Sep 16, 2020
…received block message (#3317) **Under heavy load (>10k tps sent by clients network-wide), the current Near node has not performed well (< 0.1 bps, ~100 tps processed, network-wide).** (See plot [here](https://drive.google.com/file/d/1xiaxvHSTJNKHjTYIztbwhFuOaFMbu1Mr/view?usp=sharing)) It turns out the largest contributor to this slowness was actually in the `Peer` actor. All messages received from a peer are funneled through the same actor (there is one such actor per peer), and peers will forward transactions to the next chunk producer in order to minimize the time until each transaction is first included in a block. However, when there are 10k transactions received per second across the network, this means all of them are pushed onto a single node and funneled through a small number of peer actors. As a result, there can be a long queue of messages waiting to be deserialized and dispatched to the appropriate handler. And there can be long runs of transaction messages in this queue, introducing a lot of latency in processing more "important" messages like blocks and their approvals. In this PR we fix this issue by identifying which messages are transactions _before_ deserializing and limiting the number of these message per block message received (across all peers). Messages in excess of this limit are simply dropped (never deserialized, never dispatched). **With this change, the network is able to maintain ~0.3 bps and ~600 tps throughput, even under heavy load.** (See plot below)  Fixes #3110 (sort of -- at least the network does not stall now) Notes: * This could lead to a bad UX in situations where the network is receiving more transactions than it can process because transactions will be randomly dropped. One solution to this could be to have a dedicated actor for deserializing and dispatching specifically transaction messages, though it should still have a finite queue to prevent dos attacks where many large transactions are used to run the node out of memory. * The current solution to detect when a message is a transaction without fully deserializing is a little hacky, and very brittle. It might be worth introducing this sort of type detection as a first class part of Borsh so that the logic would be automatically generated with the existing derive macros. * ~~In several places throughout the code base we are using `std::sync::RwLock`, whereas here I am making the choice to use `parking_lot::RwLock`. The reason for this choice is because of the [fairness guarantee](https://docs.rs/parking_lot/0.11.0/parking_lot/type.RwLock.html#fairness) on the `parking_lot` variant. Essentially, the `std` version does not make any guarantee about what thread will access the lock in the case of contention, and it could be that the thread which most recently held the lock will obtain it repeatedly. This is undesirable here because each peer actor runs in a separate thread and all of them should have equal access to the lock.~~ Replaced with atomic primitive. * Presently I am only detecting and limiting `ForwardTx` type message because `PeerMessage::Transaction` is never constructed or sent as far as I can tell. Should that message type be deprecated somehow?
chefsale
pushed a commit
that referenced
this issue
Sep 21, 2020
…received block message (#3317) **Under heavy load (>10k tps sent by clients network-wide), the current Near node has not performed well (< 0.1 bps, ~100 tps processed, network-wide).** (See plot [here](https://drive.google.com/file/d/1xiaxvHSTJNKHjTYIztbwhFuOaFMbu1Mr/view?usp=sharing)) It turns out the largest contributor to this slowness was actually in the `Peer` actor. All messages received from a peer are funneled through the same actor (there is one such actor per peer), and peers will forward transactions to the next chunk producer in order to minimize the time until each transaction is first included in a block. However, when there are 10k transactions received per second across the network, this means all of them are pushed onto a single node and funneled through a small number of peer actors. As a result, there can be a long queue of messages waiting to be deserialized and dispatched to the appropriate handler. And there can be long runs of transaction messages in this queue, introducing a lot of latency in processing more "important" messages like blocks and their approvals. In this PR we fix this issue by identifying which messages are transactions _before_ deserializing and limiting the number of these message per block message received (across all peers). Messages in excess of this limit are simply dropped (never deserialized, never dispatched). **With this change, the network is able to maintain ~0.3 bps and ~600 tps throughput, even under heavy load.** (See plot below)  Fixes #3110 (sort of -- at least the network does not stall now) Notes: * This could lead to a bad UX in situations where the network is receiving more transactions than it can process because transactions will be randomly dropped. One solution to this could be to have a dedicated actor for deserializing and dispatching specifically transaction messages, though it should still have a finite queue to prevent dos attacks where many large transactions are used to run the node out of memory. * The current solution to detect when a message is a transaction without fully deserializing is a little hacky, and very brittle. It might be worth introducing this sort of type detection as a first class part of Borsh so that the logic would be automatically generated with the existing derive macros. * ~~In several places throughout the code base we are using `std::sync::RwLock`, whereas here I am making the choice to use `parking_lot::RwLock`. The reason for this choice is because of the [fairness guarantee](https://docs.rs/parking_lot/0.11.0/parking_lot/type.RwLock.html#fairness) on the `parking_lot` variant. Essentially, the `std` version does not make any guarantee about what thread will access the lock in the case of contention, and it could be that the thread which most recently held the lock will obtain it repeatedly. This is undesirable here because each peer actor runs in a separate thread and all of them should have equal access to the lock.~~ Replaced with atomic primitive. * Presently I am only detecting and limiting `ForwardTx` type message because `PeerMessage::Transaction` is never constructed or sent as far as I can tell. Should that message type be deprecated somehow?
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
We observed on the mocknet that sending more than 500 transactions per second slows the network down, and it starts processing fewer transactions than when it is not under such a heavy load.
The network performance must not be negatively affected by the number of transactions submitted.
The text was updated successfully, but these errors were encountered: