Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit number of transaction messages deserialized by peer actors per received block message #3317

Merged
merged 7 commits into from
Sep 16, 2020

Conversation

birchmd
Copy link
Contributor

@birchmd birchmd commented Sep 14, 2020

Under heavy load (>10k tps sent by clients network-wide), the current Near node has not performed well (< 0.1 bps, ~100 tps processed, network-wide). (See plot here)

It turns out the largest contributor to this slowness was actually in the Peer actor. All messages received from a peer are funneled through the same actor (there is one such actor per peer), and peers will forward transactions to the next chunk producer in order to minimize the time until each transaction is first included in a block. However, when there are 10k transactions received per second across the network, this means all of them are pushed onto a single node and funneled through a small number of peer actors. As a result, there can be a long queue of messages waiting to be deserialized and dispatched to the appropriate handler. And there can be long runs of transaction messages in this queue, introducing a lot of latency in processing more "important" messages like blocks and their approvals.

In this PR we fix this issue by identifying which messages are transactions before deserializing and limiting the number of these message per block message received (across all peers). Messages in excess of this limit are simply dropped (never deserialized, never dispatched). With this change, the network is able to maintain ~0.3 bps and ~600 tps throughput, even under heavy load. (See plot below)

tps_bps_with_network_throttling

Fixes #3110 (sort of -- at least the network does not stall now)

Notes:

  • This could lead to a bad UX in situations where the network is receiving more transactions than it can process because transactions will be randomly dropped. One solution to this could be to have a dedicated actor for deserializing and dispatching specifically transaction messages, though it should still have a finite queue to prevent dos attacks where many large transactions are used to run the node out of memory.

  • The current solution to detect when a message is a transaction without fully deserializing is a little hacky, and very brittle. It might be worth introducing this sort of type detection as a first class part of Borsh so that the logic would be automatically generated with the existing derive macros.

  • In several places throughout the code base we are using std::sync::RwLock, whereas here I am making the choice to use parking_lot::RwLock. The reason for this choice is because of the fairness guarantee on the parking_lot variant. Essentially, the std version does not make any guarantee about what thread will access the lock in the case of contention, and it could be that the thread which most recently held the lock will obtain it repeatedly. This is undesirable here because each peer actor runs in a separate thread and all of them should have equal access to the lock. Replaced with atomic primitive.

  • Presently I am only detecting and limiting ForwardTx type message because PeerMessage::Transaction is never constructed or sent as far as I can tell. Should that message type be deprecated somehow?

…s is important because the deserialization time introduces enough latency on other messages to slow down block production when under heavy transaction load
@gitpod-io
Copy link

gitpod-io bot commented Sep 14, 2020

body: RoutedMessageBody::ForwardTx(tx),
});
let bytes = msg.try_to_vec().unwrap();
assert!(is_forward_tx(&bytes));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to cover other situations (SECP key, hash instead of peer id, etc).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 9674927

let ttl_idx = signature_variant_idx + signature_field_len;
let message_body_idx = ttl_idx + 1;

bytes[message_body_idx] == 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What field does this one correspond to?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This byte is the enum variant of RoutedMessageBody.

@@ -165,6 +173,8 @@ pub struct Peer {
last_time_received_message_update: Instant,
/// Dynamic Prometheus metrics
network_metrics: NetworkMetrics,
/// How many transactions we have received since the last block message
txns_since_last_block: Arc<RwLock<usize>>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use Arc<AtomicUsize> instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 7ee90fb

}
}

pub fn is_forward_tx(bytes: &[u8]) -> bool {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for this particular function, what prevents malicious actor from constructing bytes according to this function so that it returns true but the bytes do not actually deserialize to a transaction?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An attacker could do that, but it would not accomplish anything.

The counter tracking the number of transactions is only incremented after successful deserialization, so an attacker could not artificially cause the node to reject transactions early. And if the node were to already be under load then any message which returns true from is_forward_tx will be dropped so again the attacker is not accomplishing anything.

I think the worst thing an attacker could do would be to send many well-formed transaction messages (so they correctly deserialized), but where they are not valid transactions. Then the node would stop accepting any more transaction messages until the next block, but no new transactions would have been added to the mempool. Maybe we need more logic to ban peers which attempt such attacks?

@@ -641,6 +659,13 @@ impl StreamHandler<Result<Vec<u8>, ReasonForBan>> for Peer {
return;
}
};
if let PeerMessage::Routed(rm) = &peer_msg {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not if let PeerMessage::Routed(RoutedMessageBody::ForwardTx(_)) = &peer_msg?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in ddacef7


pub fn is_forward_tx(bytes: &[u8]) -> bool {
// PeerMessage::Routed variant == 13
if bytes[0] == 13 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do we check bytes length?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Addressed in 1af9bf0

Comment on lines +88 to +105
let peer_message_variant = *bytes.get(0)?;

// PeerMessage::Routed variant == 13
if peer_message_variant != 13 {
return Some(false);
}

let target_field_variant = *bytes.get(1)?;
let target_field_len = if target_field_variant == 0 {
// PeerIdOrHash::PeerId
let peer_id_variant = *bytes.get(2)?;
peer_id_type_field_len(peer_id_variant)?
} else if target_field_variant == 1 {
// PeerIdOrHash::Hash is always 32 bytes
32
} else {
return None;
};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is error-prone. It implicitly relies on enum variants order, borsh serialization, and structures layout here. There are so many things that can go wrong if we change anything in any of the underlying structure that I am not sure if it is worth the trouble. (As to the proposal to extend Borsh, I wonder how that could look like, but that would be at least maintainable and reliable)

Can we decouple deserialization into a separate actor, so we can scale it, and don't block this actor on deserialization while doing its job?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, as I said in the PR description, this solution is a little hacky and very brittle. However, I don't think there is another solution at this time. Putting deserialization into another actor does not help unless we can have a different actor for different message types (which would then still require some way to know the message type before deserializing). The issue is not that the peer actor becomes blocked, it is that if we deserialize messages in the order they are received then we introduce a lot of latency on important messages (e.g. block approval) due to long runs of many transaction messages in a row.

That said, as a temporary solution, I think this is ok because it is unlikely the various things this code depends on (enum order, borsh details, etc) will change because any such change would be a protocol change. All nodes must agree on how to communicate with each other, so changing a message format is a protocol change, and I think that should happen pretty rarely. I still agree that this code is a nightmare to maintain so we should not keep it around in the long term, but as a bandaid fix for a dos vector, I think it does the job.

Copy link
Collaborator

@SkidanovAlex SkidanovAlex Sep 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@frol to partially address your concern, this function is covered in the test that is part of the CI. So we should spot the issue if the message format has changed
(Also, changing the message format at this time is extremely unpleasant, because nodes of the future protocol versions must be able to talk to the nodes of the past).

@nearprotocol-bulldozer nearprotocol-bulldozer bot merged commit 8d2c66a into near:master Sep 16, 2020
@birchmd birchmd deleted the limit-tx-over-network branch September 16, 2020 18:09
chefsale pushed a commit that referenced this pull request Sep 21, 2020
…received block message (#3317)

**Under heavy load (>10k tps sent by clients network-wide), the current Near node has not performed well (< 0.1 bps, ~100 tps processed, network-wide).** (See plot [here](https://drive.google.com/file/d/1xiaxvHSTJNKHjTYIztbwhFuOaFMbu1Mr/view?usp=sharing))

It turns out the largest contributor to this slowness was actually in the `Peer` actor. All messages received from a peer are funneled through the same actor (there is one such actor per peer), and peers will forward transactions to the next chunk producer in order to minimize the time until each transaction is first included in a block. However, when there are 10k transactions received per second across the network, this means all of them are pushed onto a single node and funneled through a small number of peer actors. As a result, there can be a long queue of messages waiting to be deserialized and dispatched to the appropriate handler. And there can be long runs of transaction messages in this queue, introducing a lot of latency in processing more "important" messages like blocks and their approvals.

In this PR we fix this issue by identifying which messages are transactions _before_ deserializing and limiting the number of these message per block message received (across all peers). Messages in excess of this limit are simply dropped (never deserialized, never dispatched). **With this change, the network is able to maintain ~0.3 bps and ~600 tps throughput, even under heavy load.** (See plot below)

![tps_bps_with_network_throttling](https://user-images.githubusercontent.com/13788876/93106250-0a756780-f67e-11ea-9b15-708d1719d943.PNG)

Fixes #3110 (sort of -- at least the network does not stall now)

Notes:
* This could lead to a bad UX in situations where the network is receiving more transactions than it can process because transactions will be randomly dropped. One solution to this could be to have a dedicated actor for deserializing and dispatching specifically transaction messages, though it should still have a finite queue to prevent dos attacks where many large transactions are used to run the node out of memory.

* The current solution to detect when a message is a transaction without fully deserializing is a little hacky, and very brittle. It might be worth introducing this sort of type detection as a first class part of Borsh so that the logic would be automatically generated with the existing derive macros.

* ~~In several places throughout the code base we are using `std::sync::RwLock`, whereas here I am making the choice to use `parking_lot::RwLock`. The reason for this choice is because of the [fairness guarantee](https://docs.rs/parking_lot/0.11.0/parking_lot/type.RwLock.html#fairness) on the `parking_lot` variant. Essentially, the `std` version does not make any guarantee about what thread will access the lock in the case of contention, and it could be that the thread which most recently held the lock will obtain it repeatedly. This is undesirable here because each peer actor runs in a separate thread and all of them should have equal access to the lock.~~ Replaced with atomic primitive.

* Presently I am only detecting and limiting `ForwardTx` type message because `PeerMessage::Transaction` is never constructed or sent as far as I can tell. Should that message type be deprecated somehow?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Sending large quantities of transactions slows the network down
5 participants