-
Notifications
You must be signed in to change notification settings - Fork 564
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rfc: Banning peers based on ResponseCheckTx #78
Conversation
Co-authored-by: Sergio Mena <[email protected]>
Co-authored-by: Callum Waters <[email protected]>
Co-authored-by: Lasaro Camargos <[email protected]>
Co-authored-by: Lasaro Camargos <[email protected]>
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Some updates on disconnecting from a peer that repeatedly sends TXs that fail
After discussions with @jmalicevic and @ancazamfir, here are a couple of conclusions:
|
@jmalicevic and I were talking about this synchronously today and I was wondering about the case where an attacker discovers a particular transaction that they know would be accepted, and therefore propagated, by >1/3 of the voting power on the network, but rejected by the rest. This, as you mentioned before @sergio-mena, would result in halting the network for the period for which we blacklist "misconfigured" peers, because >1/3 of the voting power would be blacklisted by the remaining peers, right? I can easily imagine a situation where >1/3 of the voting power on a network has, for example, a minimum transaction fee requirement much lower than the remaining nodes. If application developers return a Overall, this approach seems to me to effectively give application developers more control, but simultaneously more ways in which to potentially halt their networks. cc @ValarDragon |
I find a little debatable the sentence, or concept, of a transaction being "accepted, and therefore propagated, by >1/3 of the voting power on the network". The reason is that a validator node is not really expected to propagate transactions. In fact, the goal of the mempool is to allow transactions to be submitted to nodes that are not validators, with some guarantees that those transactions will eventually reach a correct validator. Notice that understand that the sentence, expression, concept above could be translated in practice to nodes that are controlled, because used as sentry nodes, by validators with 1/3+ of the voting power of the network. But also in this case, this doesn't mean that transactions will not eventually reach a correct validator, as the propagation of transactions is not, or at least has never been formalized as, a process that requires 2/3+ of the (mempool) participants to be correct... |
But validators still propagate the transactions to their peers? Even though the tx propagation was not designed as them being the main gossipers? What Thane is saying I think is that, if a validator propagates the tx to a validator who will not accept it, the one that will not accept it will disconnect from the sender and thus potentially partition the network if this happens often enough? Although at a high level, is this a setting that we should even account for, this is a very messy configuration and it is debatable whether it should be allowed to exist. |
Validators are rarely connected to each other. Usually they are connected only to their sentry nodes, which should run the same code (version of the application) as the validator itself. So a validator should never receive a transaction that would be rejected by its application, as the transaction will almost always come from a node under its control, running the same code. The discussion here, that is my opinion, should be restricted to public nodes, not including the validators. |
And the consequence of this restriction to non-validators and public nodes are not minimize. To reach consensus we need a majority of the correct nodes to participate (2/3+ in Byzantine setup, 1/2+ in benign setups). To reliably broadcast a message in a network, it is enough to involve one correct node (1/3+ node in Byzantine setup, a single node in benign setups). Consensus is harder than reliably disseminating messages. |
Apart from the same app code, sentry nodes and validators should also have the same app configuration. The problem on the issue mentioned above on the Gaia testnet was about having different values for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there still anything outstanding to do here? If not, please renumber with an RFC number >= 100 and merge.
Remember, RFCs are useful for explorations of the problem space, which I think this RFC does really well. We don't need to make any decisions yet as to implementation - that can come later as our priorities allow, and this will serve as a valuable reference then if similar discussions come up again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some suggestions, in particular renaming to CometBFT.
- [Unknown message type](https://github.com/tendermint/tendermint/blob/ff0f98892f24aac11e46aeff2b6d2c0ad816701a/mempool/v0/reactor.go#L184) | ||
|
||
However, disconnecting from a peer is not the same as banning the peer. The p2p layer will close the connecton but | ||
the peer can reconnect without any penalty, and if it as a persistent peer, a reconnect will be initiated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not clear here who is persistent peer of whom.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clarified, will push new version shortly.
explicit control on when a reconnect attempt will be triggered. | ||
|
||
The application can blacklist peers via ABCI if the | ||
[`filterPeers`](../../spec/abci/abci%2B%2B_app_requirements.md#peer-filtering) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[`filterPeers`](../../spec/abci/abci%2B%2B_app_requirements.md#peer-filtering) | |
[`filterPeers`](https://github.com/cometbft/cometbft/blob/main/spec/abci/abci%2B%2B_app_requirements.md#peer-filtering) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure that a flag is needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure we should refer to main rather than to the relative path? In case the spec changes on main at some point..although I guess the RFCs will always be on main?
We propose two ways to implement peer banning based on the result of `CheckTx`: | ||
|
||
1. Introduce banning when transactions are received | ||
2. Adapt the recheck logic to support this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem of 2. is that we don't know who to punish/ban in the case the transaction has been received from multiple peers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can ban all of them as you store all the peers you received a transaction from...But you ban only when checkTx fails, and then from there on you have two possibilities:
- The transaction is still in cache as a failed tx and you ban peers who keep sending it. (I have to check whether it is mentioned in the RFC , and whether we store the failure code in the cache..for this to work we would need to)
- The transaction is not in the cache and you rerun CheckTx and ban the peer.
- We do keep it in the cache as long as possible to avoid running `CheckTX` on it because we know, for sure, that it will never be valid. As it is rare enough, it | ||
might not take that much space. In this case though, as we ban the sending peer immediately, we can save space by not storing peer information for this transaction. | ||
|
||
The question is which one is more costly, doing `CheckTx` more then once, or keeping an extra entry in the cache? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The space used in the cache is not really a problem. The cache has a fixed size, and should store hashes. The point here is that we are using one of the limited entries of the cache for an invalid transaction. Notice that as the time passes, all entries in the cache are replaced by more recently seen ones.
@thanethomson , @cason , I will document our discussion and current take on things and merge this RFC. If we decide to implement it, I will edit it with the corresponding info. But I have one more point I would like to raise before closing it. From the current discussion ,it seems that implementing it has too many concerns and we did not get sufficient concrete examples where it would benefit. However, based on the comment made by @sergio-mena above, I was thinking that adding at least a comment on a transaction failing To facilitate the debugging in a case of b), we could introduce a special code for a transaction that could never have been valid (say code 999) but not ban peers as a result, rather output this log message. |
|
||
**Discussion** | ||
|
||
The problem with supporting this scenario is the definition of the above mentioned parameters. It is very hard to estimate, at the CometBFT level, what these parameters should be. A possible solution is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, I'm generally concerned about how we want to reason about this. I think its probably useful to iterate towards this, but also expose configs so node operators can adjust.
Also this type of ban, should likely always be timebound
The question is whether this transaction should be kept track of in the cache? We can still store it in | ||
the cache so that we don't run `CheckTx` on it again, but if this peer is immediately banned, maybe there is no need | ||
to store its information. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My impression is it doesn't need to be in the cache, since anyone who sends it has a notable punishment, and must keep establishing new peer connections to DOS.
If they were doing a significant effort to establish new peer connections, I believe that with ease, they could create new malicious payloads with costly compute that get banned (rather than reuse the same payload)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! Glad peer banning for never valid txs seems like its happening! Can start mitigating a lot of bad signature DOS concerns!
Thanks for the reviews @ValarDragon ! Can you tell us more abut the DOS you mentioned? Have you seen this often, nodes sending TXs with bad signatures? |
|
||
If a node sends transactions that fail `CheckTx` but could be valid at some point, a peer should not be banned the first time this happens. | ||
Only if this happens frequently enough should this be considered as spam. To define this behaviour we keep track how many times (`numFailures`) a peer | ||
sent us invalid transactions within a time interval (`lastFailure`). This time interval should be reset every `failureResetInterval`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A smart attacker could tailor its attacks to not go over the limit within failureResetInterval
.
Maybe a counter with some form of decay would be better. For example, increment and double the counter on each error, decrement on each failureResetInterval
and ban on counter reaching amaxAllowedFailures
, for an aggressive banning and slow to pardon approach.
|
||
Should we keep transactions that could never have been valid in the cache? Assuming that receiving such transactions is rare, and the peer that sent them is banned, do we need to occupy space in the mempool cache with these transactions? | ||
|
||
- What if nodes run different versions of CometBFT and banning is not supported in one of the versions? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Being able to ban bad actors is a desirable feature and serves as incentive to upgrade.
The initial proposal is to reserve a special response code to indicate that the transaction could never have been valid. | ||
Due to concerns of this being a breaking change for applications that have already reserved this code for internal | ||
purposes, there is an alternative implementation: expanding `ResponseCheckTx` with an additional field. | ||
This field `neverValidTx` would be `false` by default. If a transaction could never have been valid, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of a boolean, maybe use an extended error code, which may be used for indicating other scenarios that could show up later.
|
||
As said, this code will [never be executed](https://github.com/tendermint/tendermint/blob/ff0f98892f24aac11e46aeff2b6d2c0ad816701a/mempool/v0/clist_mempool.go#L239) for transactions whose signature is found | ||
in the cache. | ||
Instead of remembering the cached transactions, we could have had a valid/invalid bit per transaction within the cache. As transactions themselves do not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we keep the error code from the first check? this way, if the transactions shows up again we can avoid rechecking it and still returning the same error code as before.
However, this approach has several downsides: | ||
- It is not timely enough. Recheck is executed after a block is committed, leaving room for a bad peer to send | ||
us transactions the entire time between two blocks. | ||
- If we want to keep track of when peers sent us a traansaction and punish them only if the misbehaviour happens |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- If we want to keep track of when peers sent us a traansaction and punish them only if the misbehaviour happens | |
- If we want to keep track of when peers sent us a transaction and punish them only if the misbehaviour happens |
us transactions the entire time between two blocks. | ||
- If we want to keep track of when peers sent us a traansaction and punish them only if the misbehaviour happens | ||
frequently enough, this approach makes it hard to keep track of when exactly was a transaction submitted. | ||
- Rechecking if optional and node operators can disable it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Rechecking if optional and node operators can disable it. | |
- Rechecking is optional and node operators can disable it. |
On the plus side this would avoid adding new logic to the mempool caching mechanism and keeping additional information about | ||
transaction validity. But we would still have to keep the information on peers and the frequency at which they send us bad transactions. | ||
|
||
Transactions that became invalid on recheck should not be cause for peer banning as they have not been gossiped as invalid transactions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not necessarily true. With prepare and process proposal, bad actors could send transactions that are valid but not ripe for proposal, which would always become invalid later. I think only the app can be the judge of it.
@jmalicevic I haven't seen it in production, but I am consistently paranoid about this happening any day. Basically spamming txs over p2p with 100's of signatures with one being invalid (e.g. as offered by IBC), and taking all my nodes compute resources to verify these. We don't want these to get on-chain to not consume all chain resources, thus its a costless attack vector today. (Traditionally this would be protected against via gas / fees, but IBC is a case we made a standard to not do this for) |
cometbft#78) Closes cometbft#2854 Follow up from cometbft#2960 This PR adds a simplistic 1-element block validation cache in `BlockExecutor`. This addresses a performance bottleneck raised as part of cometbft#2854 --- - [ ] Tests written/updated - [ ] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [ ] Updated relevant documentation (`docs/` or `spec/`) and code comments - [ ] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec --------- Co-authored-by: Andy Nogueira <[email protected]> (cherry picked from commit 2dd4e03) Co-authored-by: Sergio Mena <[email protected]>
Original PR: tendermint/tendermint#9675. Please look at the comments from the original PR as there was a lot of discussion on this already.
This RFC captures the changes required to implement the proposals in tendermint/tendermint#7918 , tendermint/tendermint#2185, tendermint/tendermint#6523. The gist of the document is that the mempool should be able to trigger baning of peers if they either send us transactions that fail too often, or transactions that fail due to a particular code that indicates that the application has determined a transaction to never be valid, in any state.
It closes #66 and if accepted will be followed by implementing the proposed changes.