rfc: Banning peers based on ResponseCheckTx #78

jmalicevic · 2023-01-05T11:58:18Z

Original PR: tendermint/tendermint#9675. Please look at the comments from the original PR as there was a lot of discussion on this already.

This RFC captures the changes required to implement the proposals in tendermint/tendermint#7918 , tendermint/tendermint#2185, tendermint/tendermint#6523. The gist of the document is that the mempool should be able to trigger baning of peers if they either send us transactions that fail too often, or transactions that fail due to a particular code that indicates that the application has determined a transaction to never be valid, in any state.

It closes #66 and if accepted will be followed by implementing the proposed changes.

Co-authored-by: Sergio Mena <[email protected]>

Co-authored-by: Callum Waters <[email protected]>

Co-authored-by: Lasaro Camargos <[email protected]>

github-actions · 2023-01-16T01:07:35Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

sergio-mena · 2023-01-27T12:22:07Z

Some updates on disconnecting from a peer that repeatedly sends TXs that fail CheckTx.
There was an issue on a Gaia testnet recently where the app.toml config was inconsistent across nodes

This inconsistency made some TXs pass CheckTx on some nodes, and fail on other nodes (all validators were rejecting them). Looks like the "right" config was to reject them
As result we had a functional, idle blockchain, producing mostly empty blocks
Sometimes a user would submit a transaction not affected by the inconsistent config, and the TX would make it quite quickly into a block -- no problem there
However, nodes accepting some TXs (their CheckTX returning OK) because of their misconfigured app.toml had thousands of those TXs staying in their mempool, they were constantly broadcasting them to other nodes, and running reCheckTx upon delivery of every block (them being mostly empty), causing those nodes to lag behind

After discussions with @jmalicevic and @ancazamfir, here are a couple of conclusions:

Consider the idea in this rfc consisting in a node disconnecting from a peer if the latter sends too many TXs failing CheckTx
- In the situation described above, all misconfigured nodes would have found themselves eventually disconnected from the rest of the network, thus
  - potentially partitioning the network, and thus
  - halting the chain
- Anca convinced me that this would have been a better outcome than the one we actually had (some nodes stuck with thousands of transactions in their mempools)
It would be a good idea to define a "WARNING" p2p message to be sent to the sender of a TX if the TX fails CheckTx. Something like: "you sent me a TX that failed my CheckTX, how come it passed yours?".
- A node receiving this message would log a warning log aimed at its operator. A prom metric would also be desirable
- This message/metric could be introduced even if no further action was taken (no disconnection)
- In the Gaia testnet situation decribed above, this message would have saved us roughly one day of troubleshooting

thanethomson · 2023-03-07T17:50:08Z

@jmalicevic and I were talking about this synchronously today and I was wondering about the case where an attacker discovers a particular transaction that they know would be accepted, and therefore propagated, by >1/3 of the voting power on the network, but rejected by the rest. This, as you mentioned before @sergio-mena, would result in halting the network for the period for which we blacklist "misconfigured" peers, because >1/3 of the voting power would be blacklisted by the remaining peers, right?

I can easily imagine a situation where >1/3 of the voting power on a network has, for example, a minimum transaction fee requirement much lower than the remaining nodes. If application developers return a NEVER COULD HAVE BEEN VALID value from CheckTx here, they could halt their network.

Overall, this approach seems to me to effectively give application developers more control, but simultaneously more ways in which to potentially halt their networks.

cc @ValarDragon

cason · 2023-03-08T08:17:54Z

I find a little debatable the sentence, or concept, of a transaction being "accepted, and therefore propagated, by >1/3 of the voting power on the network". The reason is that a validator node is not really expected to propagate transactions. In fact, the goal of the mempool is to allow transactions to be submitted to nodes that are not validators, with some guarantees that those transactions will eventually reach a correct validator.

Notice that understand that the sentence, expression, concept above could be translated in practice to nodes that are controlled, because used as sentry nodes, by validators with 1/3+ of the voting power of the network. But also in this case, this doesn't mean that transactions will not eventually reach a correct validator, as the propagation of transactions is not, or at least has never been formalized as, a process that requires 2/3+ of the (mempool) participants to be correct...

jmalicevic · 2023-03-08T08:48:43Z

But validators still propagate the transactions to their peers? Even though the tx propagation was not designed as them being the main gossipers? What Thane is saying I think is that, if a validator propagates the tx to a validator who will not accept it, the one that will not accept it will disconnect from the sender and thus potentially partition the network if this happens often enough?

Although at a high level, is this a setting that we should even account for, this is a very messy configuration and it is debatable whether it should be allowed to exist.

cason · 2023-03-08T08:54:45Z

if a validator propagates the tx to a validator

Validators are rarely connected to each other. Usually they are connected only to their sentry nodes, which should run the same code (version of the application) as the validator itself. So a validator should never receive a transaction that would be rejected by its application, as the transaction will almost always come from a node under its control, running the same code.

The discussion here, that is my opinion, should be restricted to public nodes, not including the validators.

cason · 2023-03-08T08:59:30Z

And the consequence of this restriction to non-validators and public nodes are not minimize.

To reach consensus we need a majority of the correct nodes to participate (2/3+ in Byzantine setup, 1/2+ in benign setups). To reliably broadcast a message in a network, it is enough to involve one correct node (1/3+ node in Byzantine setup, a single node in benign setups). Consensus is harder than reliably disseminating messages.

hvanz · 2023-03-08T11:28:06Z

their sentry nodes, which should run the same code (version of the application) as the validator itself. So a validator should never receive a transaction that would be rejected by its application, as the transaction will almost always come from a node under its control, running the same code.

Apart from the same app code, sentry nodes and validators should also have the same app configuration. The problem on the issue mentioned above on the Gaia testnet was about having different values for minimum-gas-price. So because of this misconfiguration, in this case it could have happened that sentry nodes were accepting txs that their validator was rejecting.

thanethomson

Is there still anything outstanding to do here? If not, please renumber with an RFC number >= 100 and merge.

Remember, RFCs are useful for explorations of the problem space, which I think this RFC does really well. We don't need to make any decisions yet as to implementation - that can come later as our priorities allow, and this will serve as a valuable reference then if similar discussions come up again.

cason

Left some suggestions, in particular renaming to CometBFT.

docs/rfc/rfc-026-p2p-bad-peers-checktx.md

cason · 2023-03-14T16:07:29Z

docs/rfc/rfc-026-p2p-bad-peers-checktx.md

+  - [Unknown message type](https://github.com/tendermint/tendermint/blob/ff0f98892f24aac11e46aeff2b6d2c0ad816701a/mempool/v0/reactor.go#L184)
+
+However, disconnecting from a peer is not the same as banning the peer. The p2p layer will close the connecton but 
+the peer can reconnect without any penalty, and if it as a persistent peer, a reconnect will be initiated


Not clear here who is persistent peer of whom.

Clarified, will push new version shortly.

docs/rfc/rfc-026-p2p-bad-peers-checktx.md

cason · 2023-03-14T16:10:55Z

docs/rfc/rfc-026-p2p-bad-peers-checktx.md

+explicit control on when a reconnect attempt will be triggered.
+
+The application can blacklist peers via ABCI if the 
+[`filterPeers`](../../spec/abci/abci%2B%2B_app_requirements.md#peer-filtering) 


Suggested change

[`filterPeers`](../../spec/abci/abci%2B%2B_app_requirements.md#peer-filtering)

[`filterPeers`](https://github.com/cometbft/cometbft/blob/main/spec/abci/abci%2B%2B_app_requirements.md#peer-filtering)

Are you sure that a flag is needed?

Are you sure we should refer to main rather than to the relative path? In case the spec changes on main at some point..although I guess the RFCs will always be on main?

cason · 2023-03-14T16:30:23Z

docs/rfc/rfc-026-p2p-bad-peers-checktx.md

+We propose two ways to implement peer banning based on the result of `CheckTx`:
+
+1. Introduce banning when transactions are received
+2. Adapt the recheck logic to support this


The problem of 2. is that we don't know who to punish/ban in the case the transaction has been received from multiple peers.

You can ban all of them as you store all the peers you received a transaction from...But you ban only when checkTx fails, and then from there on you have two possibilities:

The transaction is still in cache as a failed tx and you ban peers who keep sending it. (I have to check whether it is mentioned in the RFC , and whether we store the failure code in the cache..for this to work we would need to)

The transaction is not in the cache and you rerun CheckTx and ban the peer.

cason · 2023-03-14T16:33:39Z

docs/rfc/rfc-026-p2p-bad-peers-checktx.md

+- We do keep it in the cache as long as possible to avoid running `CheckTX` on it because we know, for sure, that it will never be valid. As it is rare enough, it 
+might not take that much space. In this case though, as we ban the sending peer immediately, we can save space by not storing peer information for this transaction. 
+
+The question is which one is more costly, doing `CheckTx` more then once, or keeping an extra entry in the cache?


The space used in the cache is not really a problem. The cache has a fixed size, and should store hashes. The point here is that we are using one of the limited entries of the cache for an invalid transaction. Notice that as the time passes, all entries in the cache are replaced by more recently seen ones.

docs/rfc/rfc-026-p2p-bad-peers-checktx.md

Co-authored-by: Daniel <[email protected]>

…into jasmina/p2p-bad-peers

…ometbft into jasmina/p2p-bad-peers" This reverts commit aaf4b4c, reversing changes made to 7d85af3.

jmalicevic · 2023-03-15T10:40:22Z

@thanethomson , @cason , I will document our discussion and current take on things and merge this RFC. If we decide to implement it, I will edit it with the corresponding info. But I have one more point I would like to raise before closing it.

From the current discussion ,it seems that implementing it has too many concerns and we did not get sufficient concrete examples where it would benefit.

However, based on the comment made by @sergio-mena above, I was thinking that adding at least a comment on a transaction failing CheckTx might help. But it is absolutely possible for a node to have gossiped a tx that became invalid due to a state change, no? This would create a lot of noise again. So we distinguish two cases:
a) a tx fails CheckTx because of a state change and
b) a tx fails due to a misconfiguration.

To facilitate the debugging in a case of b), we could introduce a special code for a transaction that could never have been valid (say code 999) but not ban peers as a result, rather output this log message.

ValarDragon · 2023-03-16T09:42:00Z

docs/rfc/rfc-101-p2p-bad-peers-checktx.md

+
+ **Discussion**
+
+ The problem with supporting this scenario is the definition of the above mentioned parameters. It is very hard to estimate, at the CometBFT level, what these parameters should be. A possible solution is 


I agree, I'm generally concerned about how we want to reason about this. I think its probably useful to iterate towards this, but also expose configs so node operators can adjust.

Also this type of ban, should likely always be timebound

ValarDragon · 2023-03-16T09:44:20Z

docs/rfc/rfc-101-p2p-bad-peers-checktx.md

+The question is whether this transaction should be kept track of in the cache? We can still store it in 
+the cache so that we don't run `CheckTx` on it again, but if this peer is immediately banned, maybe there is no need
+to store its information.


My impression is it doesn't need to be in the cache, since anyone who sends it has a notable punishment, and must keep establishing new peer connections to DOS.

If they were doing a significant effort to establish new peer connections, I believe that with ease, they could create new malicious payloads with costly compute that get banned (rather than reuse the same payload)

ValarDragon

Awesome! Glad peer banning for never valid txs seems like its happening! Can start mitigating a lot of bad signature DOS concerns!

jmalicevic · 2023-03-16T10:06:32Z

Thanks for the reviews @ValarDragon ! Can you tell us more abut the DOS you mentioned? Have you seen this often, nodes sending TXs with bad signatures?

docs/rfc/rfc-101-p2p-bad-peers-checktx.md

lasarojc · 2023-03-16T10:58:26Z

docs/rfc/rfc-101-p2p-bad-peers-checktx.md

+
+If a node sends transactions that fail `CheckTx` but could be valid at some point, a peer should not be banned the first time this happens.
+Only if this happens frequently enough should this be considered as spam. To define this behaviour we keep track how many times (`numFailures`) a peer 
+sent us invalid transactions within a time interval (`lastFailure`). This time interval should be reset every `failureResetInterval`. 


A smart attacker could tailor its attacks to not go over the limit within failureResetInterval.
Maybe a counter with some form of decay would be better. For example, increment and double the counter on each error, decrement on each failureResetInterval and ban on counter reaching amaxAllowedFailures, for an aggressive banning and slow to pardon approach.

lasarojc · 2023-03-16T11:12:36Z

docs/rfc/rfc-101-p2p-bad-peers-checktx.md

+
+Should we keep transactions that could never have been valid in the cache? Assuming that receiving such transactions is rare, and the peer that sent them is banned, do we need to occupy space in the mempool cache with these transactions?
+
+- What if nodes run different versions of CometBFT and banning is not supported in one of the versions? 


Being able to ban bad actors is a desirable feature and serves as incentive to upgrade.

lasarojc · 2023-03-16T11:17:11Z

docs/rfc/rfc-101-p2p-bad-peers-checktx.md

+The initial proposal is to reserve a special response code to indicate that the transaction could never have been valid. 
+Due to concerns of this being a breaking change for applications that have already reserved this code for internal
+purposes, there is an alternative implementation: expanding `ResponseCheckTx` with an additional field. 
+This field `neverValidTx` would be `false` by default. If a transaction could never have been valid,


instead of a boolean, maybe use an extended error code, which may be used for indicating other scenarios that could show up later.

docs/rfc/rfc-101-p2p-bad-peers-checktx.md

lasarojc · 2023-03-16T11:29:42Z

docs/rfc/rfc-101-p2p-bad-peers-checktx.md

+
+As said, this code will [never be executed](https://github.com/tendermint/tendermint/blob/ff0f98892f24aac11e46aeff2b6d2c0ad816701a/mempool/v0/clist_mempool.go#L239) for transactions whose signature is found
+in the cache. 
+Instead of remembering the cached transactions, we could have had a valid/invalid bit per transaction within the cache. As transactions themselves do not


can we keep the error code from the first check? this way, if the transactions shows up again we can avoid rechecking it and still returning the same error code as before.

lasarojc · 2023-03-16T11:33:12Z

docs/rfc/rfc-101-p2p-bad-peers-checktx.md

+However, this approach has several downsides:
+- It is not timely enough. Recheck is executed after a block is committed, leaving room for a bad peer to send 
+us transactions the entire time between two blocks. 
+- If we want to keep track of when peers sent us a traansaction and punish them only if the misbehaviour happens 


Suggested change

- If we want to keep track of when peers sent us a traansaction and punish them only if the misbehaviour happens

- If we want to keep track of when peers sent us a transaction and punish them only if the misbehaviour happens

lasarojc · 2023-03-16T11:33:43Z

docs/rfc/rfc-101-p2p-bad-peers-checktx.md

+us transactions the entire time between two blocks. 
+- If we want to keep track of when peers sent us a traansaction and punish them only if the misbehaviour happens 
+frequently enough, this approach makes it hard to keep track of when exactly was a transaction submitted. 
+- Rechecking if optional and node operators can disable it. 


Suggested change

- Rechecking if optional and node operators can disable it.

- Rechecking is optional and node operators can disable it.

lasarojc · 2023-03-16T11:39:00Z

docs/rfc/rfc-101-p2p-bad-peers-checktx.md

+On the plus side this would avoid adding new logic to the mempool caching mechanism and keeping additional information about
+transaction validity. But we would still have to keep the information on peers and the frequency at which they send us bad transactions.
+
+Transactions that became invalid on recheck should not be cause for peer banning as they have not been gossiped as invalid transactions. 


This is not necessarily true. With prepare and process proposal, bad actors could send transactions that are valid but not ripe for proposal, which would always become invalid later. I think only the app can be the judge of it.

ValarDragon · 2023-03-16T12:48:49Z

@jmalicevic I haven't seen it in production, but I am consistently paranoid about this happening any day.

Basically spamming txs over p2p with 100's of signatures with one being invalid (e.g. as offered by IBC), and taking all my nodes compute resources to verify these. We don't want these to get on-chain to not consume all chain resources, thus its a costless attack vector today. (Traditionally this would be protected against via gas / fees, but IBC is a case we made a standard to not do this for)

cometbft#78) Closes cometbft#2854 Follow up from cometbft#2960 This PR adds a simplistic 1-element block validation cache in `BlockExecutor`. This addresses a performance bottleneck raised as part of cometbft#2854 --- - [ ] Tests written/updated - [ ] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [ ] Updated relevant documentation (`docs/` or `spec/`) and code comments - [ ] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec --------- Co-authored-by: Andy Nogueira <[email protected]> (cherry picked from commit 2dd4e03) Co-authored-by: Sergio Mena <[email protected]>

jmalicevic and others added 17 commits November 4, 2022 16:07

Initial draft and findings

542954d

Additional details on peer banning

66aac3e

Added implementation proposal

4cdce34

Added broadcast tx considerations

dea8929

Removed broadcast tx considerations and updated code

13ca6fe

Minor edits

c80fb53

Minor edits

6fc0f1e

abci impact

a3e06a7

title

7f5c50f

Added toc entry

13ba5e7

Spelling corrections

6d5d45b

Co-authored-by: Sergio Mena <[email protected]>

Update docs/rfc/rfc-026-p2p-bad-peers-checktx.md

c3fde54

Co-authored-by: Callum Waters <[email protected]>

Applied PR comments and discussion feedback

a5ce4ac

spelling

477be5b

Co-authored-by: Lasaro Camargos <[email protected]>

spelling corrections

9099898

Co-authored-by: Lasaro Camargos <[email protected]>

incorporate ibc-go feedback from adi

90d1e93

Updated section on current banning

7c0a558

jmalicevic added docs mempool p2p wip Work in progress labels Jan 5, 2023

jmalicevic self-assigned this Jan 5, 2023

jmalicevic requested a review from a team as a code owner January 5, 2023 11:58

Merge branch 'main' into jasmina/p2p-bad-peers

03a8c11

github-actions bot added the stale For use by stalebot label Jan 16, 2023

lasarojc removed the stale For use by stalebot label Jan 16, 2023

thanethomson added documentation Improvements or additions to documentation and removed docs labels Feb 9, 2023

thanethomson approved these changes Mar 14, 2023

View reviewed changes

cason reviewed Mar 14, 2023

View reviewed changes

jmalicevic and others added 7 commits March 15, 2023 11:06

Applied renaming fixes

e3bf71e

Co-authored-by: Daniel <[email protected]>

Merge remote-tracking branch 'origin/main' into jasmina/p2p-bad-peers

d99fed8

Restructured rfc folder, created rfc 101

7d85af3

Merge branch 'jasmina/p2p-bad-peers' of github.com:cometbft/cometbft …

aaf4b4c

…into jasmina/p2p-bad-peers

Applied renaming fixes

d829895

Revert "Merge branch 'jasmina/p2p-bad-peers' of github.com:cometbft/c…

ae78ccc

…ometbft into jasmina/p2p-bad-peers" This reverts commit aaf4b4c, reversing changes made to 7d85af3.

Added toc

8b348b3

ValarDragon reviewed Mar 16, 2023

View reviewed changes

ValarDragon approved these changes Mar 16, 2023

View reviewed changes

Added github discussion summary

6c7e82d

lasarojc reviewed Mar 16, 2023

View reviewed changes

jmalicevic added 2 commits March 30, 2023 00:31

Final comments

c81e0fc

Merge branch 'main' into jasmina/p2p-bad-peers

a8ccc51

jmalicevic merged commit 38f4756 into main Mar 29, 2023

jmalicevic deleted the jasmina/p2p-bad-peers branch March 29, 2023 22:47

thanethomson added rfc Request for comments and removed wip Work in progress labels Jul 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rfc: Banning peers based on ResponseCheckTx #78

rfc: Banning peers based on ResponseCheckTx #78

jmalicevic commented Jan 5, 2023 •

edited

Loading

github-actions bot commented Jan 16, 2023

sergio-mena commented Jan 27, 2023 •

edited

Loading

thanethomson commented Mar 7, 2023 •

edited

Loading

cason commented Mar 8, 2023

jmalicevic commented Mar 8, 2023

cason commented Mar 8, 2023

cason commented Mar 8, 2023

hvanz commented Mar 8, 2023

thanethomson left a comment

cason left a comment

cason Mar 14, 2023

jmalicevic Mar 16, 2023

cason Mar 14, 2023

cason Mar 14, 2023

jmalicevic Mar 15, 2023

cason Mar 14, 2023

jmalicevic Mar 15, 2023

cason Mar 14, 2023

jmalicevic commented Mar 15, 2023

ValarDragon Mar 16, 2023

ValarDragon Mar 16, 2023

ValarDragon left a comment

jmalicevic commented Mar 16, 2023

lasarojc Mar 16, 2023

lasarojc Mar 16, 2023

lasarojc Mar 16, 2023

lasarojc Mar 16, 2023

lasarojc Mar 16, 2023

lasarojc Mar 16, 2023

lasarojc Mar 16, 2023

ValarDragon commented Mar 16, 2023

	[`filterPeers`](../../spec/abci/abci%2B%2B_app_requirements.md#peer-filtering)
	[`filterPeers`](https://github.com/cometbft/cometbft/blob/main/spec/abci/abci%2B%2B_app_requirements.md#peer-filtering)


		Discussion

		The problem with supporting this scenario is the definition of the above mentioned parameters. It is very hard to estimate, at the CometBFT level, what these parameters should be. A possible solution is


		Should we keep transactions that could never have been valid in the cache? Assuming that receiving such transactions is rare, and the peer that sent them is banned, do we need to occupy space in the mempool cache with these transactions?

		- What if nodes run different versions of CometBFT and banning is not supported in one of the versions?

	- If we want to keep track of when peers sent us a traansaction and punish them only if the misbehaviour happens
	- If we want to keep track of when peers sent us a transaction and punish them only if the misbehaviour happens

	- Rechecking if optional and node operators can disable it.
	- Rechecking is optional and node operators can disable it.

rfc: Banning peers based on ResponseCheckTx #78

rfc: Banning peers based on ResponseCheckTx #78

Conversation

jmalicevic commented Jan 5, 2023 • edited Loading

github-actions bot commented Jan 16, 2023

sergio-mena commented Jan 27, 2023 • edited Loading

thanethomson commented Mar 7, 2023 • edited Loading

cason commented Mar 8, 2023

jmalicevic commented Mar 8, 2023

cason commented Mar 8, 2023

cason commented Mar 8, 2023

hvanz commented Mar 8, 2023

thanethomson left a comment

Choose a reason for hiding this comment

cason left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jmalicevic commented Mar 15, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ValarDragon left a comment

Choose a reason for hiding this comment

jmalicevic commented Mar 16, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ValarDragon commented Mar 16, 2023

jmalicevic commented Jan 5, 2023 •

edited

Loading

sergio-mena commented Jan 27, 2023 •

edited

Loading

thanethomson commented Mar 7, 2023 •

edited

Loading