Limit number of transaction messages deserialized by peer actors per received block message #3317

birchmd · 2020-09-14T15:41:39Z

Under heavy load (>10k tps sent by clients network-wide), the current Near node has not performed well (< 0.1 bps, ~100 tps processed, network-wide). (See plot here)

It turns out the largest contributor to this slowness was actually in the Peer actor. All messages received from a peer are funneled through the same actor (there is one such actor per peer), and peers will forward transactions to the next chunk producer in order to minimize the time until each transaction is first included in a block. However, when there are 10k transactions received per second across the network, this means all of them are pushed onto a single node and funneled through a small number of peer actors. As a result, there can be a long queue of messages waiting to be deserialized and dispatched to the appropriate handler. And there can be long runs of transaction messages in this queue, introducing a lot of latency in processing more "important" messages like blocks and their approvals.

In this PR we fix this issue by identifying which messages are transactions before deserializing and limiting the number of these message per block message received (across all peers). Messages in excess of this limit are simply dropped (never deserialized, never dispatched). With this change, the network is able to maintain ~0.3 bps and ~600 tps throughput, even under heavy load. (See plot below)

Fixes #3110 (sort of -- at least the network does not stall now)

Notes:

This could lead to a bad UX in situations where the network is receiving more transactions than it can process because transactions will be randomly dropped. One solution to this could be to have a dedicated actor for deserializing and dispatching specifically transaction messages, though it should still have a finite queue to prevent dos attacks where many large transactions are used to run the node out of memory.
The current solution to detect when a message is a transaction without fully deserializing is a little hacky, and very brittle. It might be worth introducing this sort of type detection as a first class part of Borsh so that the logic would be automatically generated with the existing derive macros.
In several places throughout the code base we are using std::sync::RwLock, whereas here I am making the choice to use parking_lot::RwLock. The reason for this choice is because of the fairness guarantee on the parking_lot variant. Essentially, the std version does not make any guarantee about what thread will access the lock in the case of contention, and it could be that the thread which most recently held the lock will obtain it repeatedly. This is undesirable here because each peer actor runs in a separate thread and all of them should have equal access to the lock. Replaced with atomic primitive.
Presently I am only detecting and limiting ForwardTx type message because PeerMessage::Transaction is never constructed or sent as far as I can tell. Should that message type be deprecated somehow?

…s is important because the deserialization time introduces enough latency on other messages to slow down block production when under heavy transaction load

gitpod-io · 2020-09-14T15:41:43Z

SkidanovAlex · 2020-09-14T16:24:03Z

chain/network/src/codec.rs

+            body: RoutedMessageBody::ForwardTx(tx),
+        });
+        let bytes = msg.try_to_vec().unwrap();
+        assert!(is_forward_tx(&bytes));


We need to cover other situations (SECP key, hash instead of peer id, etc).

Done in 9674927

SkidanovAlex · 2020-09-14T16:24:29Z

chain/network/src/codec.rs

+        let ttl_idx = signature_variant_idx + signature_field_len;
+        let message_body_idx = ttl_idx + 1;
+
+        bytes[message_body_idx] == 1


What field does this one correspond to?

This byte is the enum variant of RoutedMessageBody.

SkidanovAlex · 2020-09-14T16:26:25Z

chain/network/src/peer.rs

@@ -165,6 +173,8 @@ pub struct Peer {
    last_time_received_message_update: Instant,
    /// Dynamic Prometheus metrics
    network_metrics: NetworkMetrics,
+    /// How many transactions we have received since the last block message
+    txns_since_last_block: Arc<RwLock<usize>>,


Use Arc<AtomicUsize> instead?

Done in 7ee90fb

bowenwang1996 · 2020-09-14T23:31:20Z

chain/network/src/codec.rs

+    }
+}
+
+pub fn is_forward_tx(bytes: &[u8]) -> bool {


for this particular function, what prevents malicious actor from constructing bytes according to this function so that it returns true but the bytes do not actually deserialize to a transaction?

An attacker could do that, but it would not accomplish anything.

The counter tracking the number of transactions is only incremented after successful deserialization, so an attacker could not artificially cause the node to reject transactions early. And if the node were to already be under load then any message which returns true from is_forward_tx will be dropped so again the attacker is not accomplishing anything.

I think the worst thing an attacker could do would be to send many well-formed transaction messages (so they correctly deserialized), but where they are not valid transactions. Then the node would stop accepting any more transaction messages until the next block, but no new transactions would have been added to the mempool. Maybe we need more logic to ban peers which attempt such attacks?

bowenwang1996 · 2020-09-14T23:33:29Z

chain/network/src/peer.rs

@@ -641,6 +659,13 @@ impl StreamHandler<Result<Vec<u8>, ReasonForBan>> for Peer {
                return;
            }
        };
+        if let PeerMessage::Routed(rm) = &peer_msg {


why not if let PeerMessage::Routed(RoutedMessageBody::ForwardTx(_)) = &peer_msg?

Done in ddacef7

evgenykuzyakov · 2020-09-15T18:40:38Z

chain/network/src/codec.rs

+
+pub fn is_forward_tx(bytes: &[u8]) -> bool {
+    // PeerMessage::Routed variant == 13
+    if bytes[0] == 13 {


Where do we check bytes length?

good point.

Good catch! Addressed in 1af9bf0

frol · 2020-09-16T09:13:41Z

chain/network/src/codec.rs

+    let peer_message_variant = *bytes.get(0)?;
+
+    // PeerMessage::Routed variant == 13
+    if peer_message_variant != 13 {
+        return Some(false);
+    }
+
+    let target_field_variant = *bytes.get(1)?;
+    let target_field_len = if target_field_variant == 0 {
+        // PeerIdOrHash::PeerId
+        let peer_id_variant = *bytes.get(2)?;
+        peer_id_type_field_len(peer_id_variant)?
+    } else if target_field_variant == 1 {
+        // PeerIdOrHash::Hash is always 32 bytes
+        32
+    } else {
+        return None;
+    };


This is error-prone. It implicitly relies on enum variants order, borsh serialization, and structures layout here. There are so many things that can go wrong if we change anything in any of the underlying structure that I am not sure if it is worth the trouble. (As to the proposal to extend Borsh, I wonder how that could look like, but that would be at least maintainable and reliable)

Can we decouple deserialization into a separate actor, so we can scale it, and don't block this actor on deserialization while doing its job?

Yes, as I said in the PR description, this solution is a little hacky and very brittle. However, I don't think there is another solution at this time. Putting deserialization into another actor does not help unless we can have a different actor for different message types (which would then still require some way to know the message type before deserializing). The issue is not that the peer actor becomes blocked, it is that if we deserialize messages in the order they are received then we introduce a lot of latency on important messages (e.g. block approval) due to long runs of many transaction messages in a row.

That said, as a temporary solution, I think this is ok because it is unlikely the various things this code depends on (enum order, borsh details, etc) will change because any such change would be a protocol change. All nodes must agree on how to communicate with each other, so changing a message format is a protocol change, and I think that should happen pretty rarely. I still agree that this code is a nightmare to maintain so we should not keep it around in the long term, but as a bandaid fix for a dos vector, I think it does the job.

@frol to partially address your concern, this function is covered in the test that is part of the CI. So we should spot the issue if the message format has changed
(Also, changing the message format at this time is extremely unpleasant, because nodes of the future protocol versions must be able to talk to the nodes of the past).

…received block message (#3317) **Under heavy load (>10k tps sent by clients network-wide), the current Near node has not performed well (< 0.1 bps, ~100 tps processed, network-wide).** (See plot [here](https://drive.google.com/file/d/1xiaxvHSTJNKHjTYIztbwhFuOaFMbu1Mr/view?usp=sharing)) It turns out the largest contributor to this slowness was actually in the `Peer` actor. All messages received from a peer are funneled through the same actor (there is one such actor per peer), and peers will forward transactions to the next chunk producer in order to minimize the time until each transaction is first included in a block. However, when there are 10k transactions received per second across the network, this means all of them are pushed onto a single node and funneled through a small number of peer actors. As a result, there can be a long queue of messages waiting to be deserialized and dispatched to the appropriate handler. And there can be long runs of transaction messages in this queue, introducing a lot of latency in processing more "important" messages like blocks and their approvals. In this PR we fix this issue by identifying which messages are transactions _before_ deserializing and limiting the number of these message per block message received (across all peers). Messages in excess of this limit are simply dropped (never deserialized, never dispatched). **With this change, the network is able to maintain ~0.3 bps and ~600 tps throughput, even under heavy load.** (See plot below) ![tps_bps_with_network_throttling](https://user-images.githubusercontent.com/13788876/93106250-0a756780-f67e-11ea-9b15-708d1719d943.PNG) Fixes #3110 (sort of -- at least the network does not stall now) Notes: * This could lead to a bad UX in situations where the network is receiving more transactions than it can process because transactions will be randomly dropped. One solution to this could be to have a dedicated actor for deserializing and dispatching specifically transaction messages, though it should still have a finite queue to prevent dos attacks where many large transactions are used to run the node out of memory. * The current solution to detect when a message is a transaction without fully deserializing is a little hacky, and very brittle. It might be worth introducing this sort of type detection as a first class part of Borsh so that the logic would be automatically generated with the existing derive macros. * ~~In several places throughout the code base we are using `std::sync::RwLock`, whereas here I am making the choice to use `parking_lot::RwLock`. The reason for this choice is because of the [fairness guarantee](https://docs.rs/parking_lot/0.11.0/parking_lot/type.RwLock.html#fairness) on the `parking_lot` variant. Essentially, the `std` version does not make any guarantee about what thread will access the lock in the case of contention, and it could be that the thread which most recently held the lock will obtain it repeatedly. This is undesirable here because each peer actor runs in a separate thread and all of them should have equal access to the lock.~~ Replaced with atomic primitive. * Presently I am only detecting and limiting `ForwardTx` type message because `PeerMessage::Transaction` is never constructed or sent as far as I can tell. Should that message type be deprecated somehow?

Limit number of transaction messages deserialized by peer actors; thi…

b3cc12e

…s is important because the deserialization time introduces enough latency on other messages to slow down block production when under heavy transaction load

birchmd requested review from bowenwang1996 and SkidanovAlex as code owners September 14, 2020 15:41

SkidanovAlex reviewed Sep 14, 2020

View reviewed changes

Use AtomicUsize instead of RwLock<usize>

7ee90fb

bowenwang1996 reviewed Sep 14, 2020

View reviewed changes

birchmd added 3 commits September 15, 2020 10:11

Simplfy nested if let to single if let

ddacef7

Test is_forward_tx works for all possible ForwardTx variants

9674927

Merge branch 'master' into limit-tx-over-network

a4972ed

birchmd requested review from evgenykuzyakov, frol and MaksymZavershynskyi as code owners September 15, 2020 17:35

evgenykuzyakov reviewed Sep 15, 2020

View reviewed changes

Use .get to to bounds checking

1af9bf0

frol reviewed Sep 16, 2020

View reviewed changes

bowenwang1996 approved these changes Sep 16, 2020

View reviewed changes

Merge branch 'master' into limit-tx-over-network

e85ecd9

birchmd added the automerge label Sep 16, 2020

nearprotocol-bulldozer bot merged commit 8d2c66a into near:master Sep 16, 2020

birchmd deleted the limit-tx-over-network branch September 16, 2020 18:09

MaksymZavershynskyi mentioned this pull request Sep 21, 2020

Research improvements to the fix 3317 #3348

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Limit number of transaction messages deserialized by peer actors per received block message #3317

Limit number of transaction messages deserialized by peer actors per received block message #3317

birchmd commented Sep 14, 2020 •

edited

Loading

gitpod-io bot commented Sep 14, 2020 •

edited

Loading

SkidanovAlex Sep 14, 2020

birchmd Sep 15, 2020

SkidanovAlex Sep 14, 2020

birchmd Sep 14, 2020

SkidanovAlex Sep 14, 2020

birchmd Sep 14, 2020

bowenwang1996 Sep 14, 2020

birchmd Sep 15, 2020

bowenwang1996 Sep 14, 2020

birchmd Sep 15, 2020

evgenykuzyakov Sep 15, 2020

bowenwang1996 Sep 15, 2020

birchmd Sep 15, 2020

frol Sep 16, 2020

birchmd Sep 16, 2020

SkidanovAlex Sep 16, 2020 •

edited

Loading

Limit number of transaction messages deserialized by peer actors per received block message #3317

Limit number of transaction messages deserialized by peer actors per received block message #3317

Conversation

birchmd commented Sep 14, 2020 • edited Loading

gitpod-io bot commented Sep 14, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SkidanovAlex Sep 16, 2020 • edited Loading

Choose a reason for hiding this comment

birchmd commented Sep 14, 2020 •

edited

Loading

gitpod-io bot commented Sep 14, 2020 •

edited

Loading

SkidanovAlex Sep 16, 2020 •

edited

Loading