-
Notifications
You must be signed in to change notification settings - Fork 638
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Limit number of transaction messages deserialized by peer actors per received block message #3317
Limit number of transaction messages deserialized by peer actors per received block message #3317
Conversation
…s is important because the deserialization time introduces enough latency on other messages to slow down block production when under heavy transaction load
chain/network/src/codec.rs
Outdated
body: RoutedMessageBody::ForwardTx(tx), | ||
}); | ||
let bytes = msg.try_to_vec().unwrap(); | ||
assert!(is_forward_tx(&bytes)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to cover other situations (SECP key, hash instead of peer id, etc).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 9674927
chain/network/src/codec.rs
Outdated
let ttl_idx = signature_variant_idx + signature_field_len; | ||
let message_body_idx = ttl_idx + 1; | ||
|
||
bytes[message_body_idx] == 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What field does this one correspond to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This byte is the enum variant of RoutedMessageBody
.
chain/network/src/peer.rs
Outdated
@@ -165,6 +173,8 @@ pub struct Peer { | |||
last_time_received_message_update: Instant, | |||
/// Dynamic Prometheus metrics | |||
network_metrics: NetworkMetrics, | |||
/// How many transactions we have received since the last block message | |||
txns_since_last_block: Arc<RwLock<usize>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use Arc<AtomicUsize>
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 7ee90fb
chain/network/src/codec.rs
Outdated
} | ||
} | ||
|
||
pub fn is_forward_tx(bytes: &[u8]) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for this particular function, what prevents malicious actor from constructing bytes according to this function so that it returns true
but the bytes do not actually deserialize to a transaction?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An attacker could do that, but it would not accomplish anything.
The counter tracking the number of transactions is only incremented after successful deserialization, so an attacker could not artificially cause the node to reject transactions early. And if the node were to already be under load then any message which returns true
from is_forward_tx
will be dropped so again the attacker is not accomplishing anything.
I think the worst thing an attacker could do would be to send many well-formed transaction messages (so they correctly deserialized), but where they are not valid transactions. Then the node would stop accepting any more transaction messages until the next block, but no new transactions would have been added to the mempool. Maybe we need more logic to ban peers which attempt such attacks?
chain/network/src/peer.rs
Outdated
@@ -641,6 +659,13 @@ impl StreamHandler<Result<Vec<u8>, ReasonForBan>> for Peer { | |||
return; | |||
} | |||
}; | |||
if let PeerMessage::Routed(rm) = &peer_msg { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not if let PeerMessage::Routed(RoutedMessageBody::ForwardTx(_)) = &peer_msg
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in ddacef7
chain/network/src/codec.rs
Outdated
|
||
pub fn is_forward_tx(bytes: &[u8]) -> bool { | ||
// PeerMessage::Routed variant == 13 | ||
if bytes[0] == 13 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do we check bytes
length?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! Addressed in 1af9bf0
let peer_message_variant = *bytes.get(0)?; | ||
|
||
// PeerMessage::Routed variant == 13 | ||
if peer_message_variant != 13 { | ||
return Some(false); | ||
} | ||
|
||
let target_field_variant = *bytes.get(1)?; | ||
let target_field_len = if target_field_variant == 0 { | ||
// PeerIdOrHash::PeerId | ||
let peer_id_variant = *bytes.get(2)?; | ||
peer_id_type_field_len(peer_id_variant)? | ||
} else if target_field_variant == 1 { | ||
// PeerIdOrHash::Hash is always 32 bytes | ||
32 | ||
} else { | ||
return None; | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is error-prone. It implicitly relies on enum variants order, borsh serialization, and structures layout here. There are so many things that can go wrong if we change anything in any of the underlying structure that I am not sure if it is worth the trouble. (As to the proposal to extend Borsh, I wonder how that could look like, but that would be at least maintainable and reliable)
Can we decouple deserialization into a separate actor, so we can scale it, and don't block this actor on deserialization while doing its job?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, as I said in the PR description, this solution is a little hacky and very brittle. However, I don't think there is another solution at this time. Putting deserialization into another actor does not help unless we can have a different actor for different message types (which would then still require some way to know the message type before deserializing). The issue is not that the peer actor becomes blocked, it is that if we deserialize messages in the order they are received then we introduce a lot of latency on important messages (e.g. block approval) due to long runs of many transaction messages in a row.
That said, as a temporary solution, I think this is ok because it is unlikely the various things this code depends on (enum order, borsh details, etc) will change because any such change would be a protocol change. All nodes must agree on how to communicate with each other, so changing a message format is a protocol change, and I think that should happen pretty rarely. I still agree that this code is a nightmare to maintain so we should not keep it around in the long term, but as a bandaid fix for a dos vector, I think it does the job.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@frol to partially address your concern, this function is covered in the test that is part of the CI. So we should spot the issue if the message format has changed
(Also, changing the message format at this time is extremely unpleasant, because nodes of the future protocol versions must be able to talk to the nodes of the past).
…received block message (#3317) **Under heavy load (>10k tps sent by clients network-wide), the current Near node has not performed well (< 0.1 bps, ~100 tps processed, network-wide).** (See plot [here](https://drive.google.com/file/d/1xiaxvHSTJNKHjTYIztbwhFuOaFMbu1Mr/view?usp=sharing)) It turns out the largest contributor to this slowness was actually in the `Peer` actor. All messages received from a peer are funneled through the same actor (there is one such actor per peer), and peers will forward transactions to the next chunk producer in order to minimize the time until each transaction is first included in a block. However, when there are 10k transactions received per second across the network, this means all of them are pushed onto a single node and funneled through a small number of peer actors. As a result, there can be a long queue of messages waiting to be deserialized and dispatched to the appropriate handler. And there can be long runs of transaction messages in this queue, introducing a lot of latency in processing more "important" messages like blocks and their approvals. In this PR we fix this issue by identifying which messages are transactions _before_ deserializing and limiting the number of these message per block message received (across all peers). Messages in excess of this limit are simply dropped (never deserialized, never dispatched). **With this change, the network is able to maintain ~0.3 bps and ~600 tps throughput, even under heavy load.** (See plot below) ![tps_bps_with_network_throttling](https://user-images.githubusercontent.com/13788876/93106250-0a756780-f67e-11ea-9b15-708d1719d943.PNG) Fixes #3110 (sort of -- at least the network does not stall now) Notes: * This could lead to a bad UX in situations where the network is receiving more transactions than it can process because transactions will be randomly dropped. One solution to this could be to have a dedicated actor for deserializing and dispatching specifically transaction messages, though it should still have a finite queue to prevent dos attacks where many large transactions are used to run the node out of memory. * The current solution to detect when a message is a transaction without fully deserializing is a little hacky, and very brittle. It might be worth introducing this sort of type detection as a first class part of Borsh so that the logic would be automatically generated with the existing derive macros. * ~~In several places throughout the code base we are using `std::sync::RwLock`, whereas here I am making the choice to use `parking_lot::RwLock`. The reason for this choice is because of the [fairness guarantee](https://docs.rs/parking_lot/0.11.0/parking_lot/type.RwLock.html#fairness) on the `parking_lot` variant. Essentially, the `std` version does not make any guarantee about what thread will access the lock in the case of contention, and it could be that the thread which most recently held the lock will obtain it repeatedly. This is undesirable here because each peer actor runs in a separate thread and all of them should have equal access to the lock.~~ Replaced with atomic primitive. * Presently I am only detecting and limiting `ForwardTx` type message because `PeerMessage::Transaction` is never constructed or sent as far as I can tell. Should that message type be deprecated somehow?
Under heavy load (>10k tps sent by clients network-wide), the current Near node has not performed well (< 0.1 bps, ~100 tps processed, network-wide). (See plot here)
It turns out the largest contributor to this slowness was actually in the
Peer
actor. All messages received from a peer are funneled through the same actor (there is one such actor per peer), and peers will forward transactions to the next chunk producer in order to minimize the time until each transaction is first included in a block. However, when there are 10k transactions received per second across the network, this means all of them are pushed onto a single node and funneled through a small number of peer actors. As a result, there can be a long queue of messages waiting to be deserialized and dispatched to the appropriate handler. And there can be long runs of transaction messages in this queue, introducing a lot of latency in processing more "important" messages like blocks and their approvals.In this PR we fix this issue by identifying which messages are transactions before deserializing and limiting the number of these message per block message received (across all peers). Messages in excess of this limit are simply dropped (never deserialized, never dispatched). With this change, the network is able to maintain ~0.3 bps and ~600 tps throughput, even under heavy load. (See plot below)
Fixes #3110 (sort of -- at least the network does not stall now)
Notes:
This could lead to a bad UX in situations where the network is receiving more transactions than it can process because transactions will be randomly dropped. One solution to this could be to have a dedicated actor for deserializing and dispatching specifically transaction messages, though it should still have a finite queue to prevent dos attacks where many large transactions are used to run the node out of memory.
The current solution to detect when a message is a transaction without fully deserializing is a little hacky, and very brittle. It might be worth introducing this sort of type detection as a first class part of Borsh so that the logic would be automatically generated with the existing derive macros.
In several places throughout the code base we are usingReplaced with atomic primitive.std::sync::RwLock
, whereas here I am making the choice to useparking_lot::RwLock
. The reason for this choice is because of the fairness guarantee on theparking_lot
variant. Essentially, thestd
version does not make any guarantee about what thread will access the lock in the case of contention, and it could be that the thread which most recently held the lock will obtain it repeatedly. This is undesirable here because each peer actor runs in a separate thread and all of them should have equal access to the lock.Presently I am only detecting and limiting
ForwardTx
type message becausePeerMessage::Transaction
is never constructed or sent as far as I can tell. Should that message type be deprecated somehow?