Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rfc: Banning peers based on ResponseCheckTx #78

Merged
merged 28 commits into from
Mar 29, 2023
Merged

Conversation

jmalicevic
Copy link
Contributor

@jmalicevic jmalicevic commented Jan 5, 2023

Original PR: tendermint/tendermint#9675. Please look at the comments from the original PR as there was a lot of discussion on this already.


This RFC captures the changes required to implement the proposals in tendermint/tendermint#7918 , tendermint/tendermint#2185, tendermint/tendermint#6523. The gist of the document is that the mempool should be able to trigger baning of peers if they either send us transactions that fail too often, or transactions that fail due to a particular code that indicates that the application has determined a transaction to never be valid, in any state.

It closes #66 and if accepted will be followed by implementing the proposed changes.

@jmalicevic jmalicevic self-assigned this Jan 5, 2023
@jmalicevic jmalicevic requested a review from a team as a code owner January 5, 2023 11:58
@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale For use by stalebot label Jan 16, 2023
@lasarojc lasarojc removed the stale For use by stalebot label Jan 16, 2023
@sergio-mena
Copy link
Contributor

sergio-mena commented Jan 27, 2023

Some updates on disconnecting from a peer that repeatedly sends TXs that fail CheckTx.
There was an issue on a Gaia testnet recently where the app.toml config was inconsistent across nodes

  • This inconsistency made some TXs pass CheckTx on some nodes, and fail on other nodes (all validators were rejecting them). Looks like the "right" config was to reject them
  • As result we had a functional, idle blockchain, producing mostly empty blocks
  • Sometimes a user would submit a transaction not affected by the inconsistent config, and the TX would make it quite quickly into a block -- no problem there
  • However, nodes accepting some TXs (their CheckTX returning OK) because of their misconfigured app.toml had thousands of those TXs staying in their mempool, they were constantly broadcasting them to other nodes, and running reCheckTx upon delivery of every block (them being mostly empty), causing those nodes to lag behind

After discussions with @jmalicevic and @ancazamfir, here are a couple of conclusions:

  • Consider the idea in this rfc consisting in a node disconnecting from a peer if the latter sends too many TXs failing CheckTx
    • In the situation described above, all misconfigured nodes would have found themselves eventually disconnected from the rest of the network, thus
      • potentially partitioning the network, and thus
      • halting the chain
    • Anca convinced me that this would have been a better outcome than the one we actually had (some nodes stuck with thousands of transactions in their mempools)
  • It would be a good idea to define a "WARNING" p2p message to be sent to the sender of a TX if the TX fails CheckTx. Something like: "you sent me a TX that failed my CheckTX, how come it passed yours?".
    • A node receiving this message would log a warning log aimed at its operator. A prom metric would also be desirable
    • This message/metric could be introduced even if no further action was taken (no disconnection)
    • In the Gaia testnet situation decribed above, this message would have saved us roughly one day of troubleshooting

@thanethomson thanethomson added documentation Improvements or additions to documentation and removed docs labels Feb 9, 2023
@thanethomson
Copy link
Contributor

thanethomson commented Mar 7, 2023

@jmalicevic and I were talking about this synchronously today and I was wondering about the case where an attacker discovers a particular transaction that they know would be accepted, and therefore propagated, by >1/3 of the voting power on the network, but rejected by the rest. This, as you mentioned before @sergio-mena, would result in halting the network for the period for which we blacklist "misconfigured" peers, because >1/3 of the voting power would be blacklisted by the remaining peers, right?

I can easily imagine a situation where >1/3 of the voting power on a network has, for example, a minimum transaction fee requirement much lower than the remaining nodes. If application developers return a NEVER COULD HAVE BEEN VALID value from CheckTx here, they could halt their network.

Overall, this approach seems to me to effectively give application developers more control, but simultaneously more ways in which to potentially halt their networks.

cc @ValarDragon

@cason
Copy link
Collaborator

cason commented Mar 8, 2023

I find a little debatable the sentence, or concept, of a transaction being "accepted, and therefore propagated, by >1/3 of the voting power on the network". The reason is that a validator node is not really expected to propagate transactions. In fact, the goal of the mempool is to allow transactions to be submitted to nodes that are not validators, with some guarantees that those transactions will eventually reach a correct validator.

Notice that understand that the sentence, expression, concept above could be translated in practice to nodes that are controlled, because used as sentry nodes, by validators with 1/3+ of the voting power of the network. But also in this case, this doesn't mean that transactions will not eventually reach a correct validator, as the propagation of transactions is not, or at least has never been formalized as, a process that requires 2/3+ of the (mempool) participants to be correct...

@jmalicevic
Copy link
Contributor Author

But validators still propagate the transactions to their peers? Even though the tx propagation was not designed as them being the main gossipers? What Thane is saying I think is that, if a validator propagates the tx to a validator who will not accept it, the one that will not accept it will disconnect from the sender and thus potentially partition the network if this happens often enough?

Although at a high level, is this a setting that we should even account for, this is a very messy configuration and it is debatable whether it should be allowed to exist.

@cason
Copy link
Collaborator

cason commented Mar 8, 2023

if a validator propagates the tx to a validator

Validators are rarely connected to each other. Usually they are connected only to their sentry nodes, which should run the same code (version of the application) as the validator itself. So a validator should never receive a transaction that would be rejected by its application, as the transaction will almost always come from a node under its control, running the same code.

The discussion here, that is my opinion, should be restricted to public nodes, not including the validators.

@cason
Copy link
Collaborator

cason commented Mar 8, 2023

And the consequence of this restriction to non-validators and public nodes are not minimize.

To reach consensus we need a majority of the correct nodes to participate (2/3+ in Byzantine setup, 1/2+ in benign setups). To reliably broadcast a message in a network, it is enough to involve one correct node (1/3+ node in Byzantine setup, a single node in benign setups). Consensus is harder than reliably disseminating messages.

@hvanz
Copy link
Member

hvanz commented Mar 8, 2023

their sentry nodes, which should run the same code (version of the application) as the validator itself. So a validator should never receive a transaction that would be rejected by its application, as the transaction will almost always come from a node under its control, running the same code.

Apart from the same app code, sentry nodes and validators should also have the same app configuration. The problem on the issue mentioned above on the Gaia testnet was about having different values for minimum-gas-price. So because of this misconfiguration, in this case it could have happened that sentry nodes were accepting txs that their validator was rejecting.

Copy link
Contributor

@thanethomson thanethomson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there still anything outstanding to do here? If not, please renumber with an RFC number >= 100 and merge.

Remember, RFCs are useful for explorations of the problem space, which I think this RFC does really well. We don't need to make any decisions yet as to implementation - that can come later as our priorities allow, and this will serve as a valuable reference then if similar discussions come up again.

Copy link
Collaborator

@cason cason left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some suggestions, in particular renaming to CometBFT.

docs/rfc/rfc-026-p2p-bad-peers-checktx.md Outdated Show resolved Hide resolved
docs/rfc/rfc-026-p2p-bad-peers-checktx.md Outdated Show resolved Hide resolved
- [Unknown message type](https://github.com/tendermint/tendermint/blob/ff0f98892f24aac11e46aeff2b6d2c0ad816701a/mempool/v0/reactor.go#L184)

However, disconnecting from a peer is not the same as banning the peer. The p2p layer will close the connecton but
the peer can reconnect without any penalty, and if it as a persistent peer, a reconnect will be initiated
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not clear here who is persistent peer of whom.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarified, will push new version shortly.

docs/rfc/rfc-026-p2p-bad-peers-checktx.md Outdated Show resolved Hide resolved
explicit control on when a reconnect attempt will be triggered.

The application can blacklist peers via ABCI if the
[`filterPeers`](../../spec/abci/abci%2B%2B_app_requirements.md#peer-filtering)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[`filterPeers`](../../spec/abci/abci%2B%2B_app_requirements.md#peer-filtering)
[`filterPeers`](https://github.com/cometbft/cometbft/blob/main/spec/abci/abci%2B%2B_app_requirements.md#peer-filtering)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure that a flag is needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure we should refer to main rather than to the relative path? In case the spec changes on main at some point..although I guess the RFCs will always be on main?

We propose two ways to implement peer banning based on the result of `CheckTx`:

1. Introduce banning when transactions are received
2. Adapt the recheck logic to support this
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem of 2. is that we don't know who to punish/ban in the case the transaction has been received from multiple peers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can ban all of them as you store all the peers you received a transaction from...But you ban only when checkTx fails, and then from there on you have two possibilities:

  • The transaction is still in cache as a failed tx and you ban peers who keep sending it. (I have to check whether it is mentioned in the RFC , and whether we store the failure code in the cache..for this to work we would need to)
  • The transaction is not in the cache and you rerun CheckTx and ban the peer.

- We do keep it in the cache as long as possible to avoid running `CheckTX` on it because we know, for sure, that it will never be valid. As it is rare enough, it
might not take that much space. In this case though, as we ban the sending peer immediately, we can save space by not storing peer information for this transaction.

The question is which one is more costly, doing `CheckTx` more then once, or keeping an extra entry in the cache?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The space used in the cache is not really a problem. The cache has a fixed size, and should store hashes. The point here is that we are using one of the limited entries of the cache for an invalid transaction. Notice that as the time passes, all entries in the cache are replaced by more recently seen ones.

docs/rfc/rfc-026-p2p-bad-peers-checktx.md Outdated Show resolved Hide resolved
docs/rfc/rfc-026-p2p-bad-peers-checktx.md Outdated Show resolved Hide resolved
docs/rfc/rfc-026-p2p-bad-peers-checktx.md Outdated Show resolved Hide resolved
@jmalicevic
Copy link
Contributor Author

@thanethomson , @cason , I will document our discussion and current take on things and merge this RFC. If we decide to implement it, I will edit it with the corresponding info. But I have one more point I would like to raise before closing it.

From the current discussion ,it seems that implementing it has too many concerns and we did not get sufficient concrete examples where it would benefit.

However, based on the comment made by @sergio-mena above, I was thinking that adding at least a comment on a transaction failing CheckTx might help. But it is absolutely possible for a node to have gossiped a tx that became invalid due to a state change, no? This would create a lot of noise again. So we distinguish two cases:
a) a tx fails CheckTx because of a state change and
b) a tx fails due to a misconfiguration.

To facilitate the debugging in a case of b), we could introduce a special code for a transaction that could never have been valid (say code 999) but not ban peers as a result, rather output this log message.


**Discussion**

The problem with supporting this scenario is the definition of the above mentioned parameters. It is very hard to estimate, at the CometBFT level, what these parameters should be. A possible solution is
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I'm generally concerned about how we want to reason about this. I think its probably useful to iterate towards this, but also expose configs so node operators can adjust.

Also this type of ban, should likely always be timebound

Comment on lines +172 to +174
The question is whether this transaction should be kept track of in the cache? We can still store it in
the cache so that we don't run `CheckTx` on it again, but if this peer is immediately banned, maybe there is no need
to store its information.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My impression is it doesn't need to be in the cache, since anyone who sends it has a notable punishment, and must keep establishing new peer connections to DOS.

If they were doing a significant effort to establish new peer connections, I believe that with ease, they could create new malicious payloads with costly compute that get banned (rather than reuse the same payload)

Copy link
Collaborator

@ValarDragon ValarDragon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! Glad peer banning for never valid txs seems like its happening! Can start mitigating a lot of bad signature DOS concerns!

@jmalicevic
Copy link
Contributor Author

Thanks for the reviews @ValarDragon ! Can you tell us more abut the DOS you mentioned? Have you seen this often, nodes sending TXs with bad signatures?

docs/rfc/rfc-101-p2p-bad-peers-checktx.md Outdated Show resolved Hide resolved

If a node sends transactions that fail `CheckTx` but could be valid at some point, a peer should not be banned the first time this happens.
Only if this happens frequently enough should this be considered as spam. To define this behaviour we keep track how many times (`numFailures`) a peer
sent us invalid transactions within a time interval (`lastFailure`). This time interval should be reset every `failureResetInterval`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A smart attacker could tailor its attacks to not go over the limit within failureResetInterval.
Maybe a counter with some form of decay would be better. For example, increment and double the counter on each error, decrement on each failureResetInterval and ban on counter reaching amaxAllowedFailures, for an aggressive banning and slow to pardon approach.


Should we keep transactions that could never have been valid in the cache? Assuming that receiving such transactions is rare, and the peer that sent them is banned, do we need to occupy space in the mempool cache with these transactions?

- What if nodes run different versions of CometBFT and banning is not supported in one of the versions?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Being able to ban bad actors is a desirable feature and serves as incentive to upgrade.

The initial proposal is to reserve a special response code to indicate that the transaction could never have been valid.
Due to concerns of this being a breaking change for applications that have already reserved this code for internal
purposes, there is an alternative implementation: expanding `ResponseCheckTx` with an additional field.
This field `neverValidTx` would be `false` by default. If a transaction could never have been valid,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of a boolean, maybe use an extended error code, which may be used for indicating other scenarios that could show up later.

docs/rfc/rfc-101-p2p-bad-peers-checktx.md Outdated Show resolved Hide resolved

As said, this code will [never be executed](https://github.com/tendermint/tendermint/blob/ff0f98892f24aac11e46aeff2b6d2c0ad816701a/mempool/v0/clist_mempool.go#L239) for transactions whose signature is found
in the cache.
Instead of remembering the cached transactions, we could have had a valid/invalid bit per transaction within the cache. As transactions themselves do not
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we keep the error code from the first check? this way, if the transactions shows up again we can avoid rechecking it and still returning the same error code as before.

However, this approach has several downsides:
- It is not timely enough. Recheck is executed after a block is committed, leaving room for a bad peer to send
us transactions the entire time between two blocks.
- If we want to keep track of when peers sent us a traansaction and punish them only if the misbehaviour happens
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- If we want to keep track of when peers sent us a traansaction and punish them only if the misbehaviour happens
- If we want to keep track of when peers sent us a transaction and punish them only if the misbehaviour happens

us transactions the entire time between two blocks.
- If we want to keep track of when peers sent us a traansaction and punish them only if the misbehaviour happens
frequently enough, this approach makes it hard to keep track of when exactly was a transaction submitted.
- Rechecking if optional and node operators can disable it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Rechecking if optional and node operators can disable it.
- Rechecking is optional and node operators can disable it.

On the plus side this would avoid adding new logic to the mempool caching mechanism and keeping additional information about
transaction validity. But we would still have to keep the information on peers and the frequency at which they send us bad transactions.

Transactions that became invalid on recheck should not be cause for peer banning as they have not been gossiped as invalid transactions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not necessarily true. With prepare and process proposal, bad actors could send transactions that are valid but not ripe for proposal, which would always become invalid later. I think only the app can be the judge of it.

@ValarDragon
Copy link
Collaborator

@jmalicevic I haven't seen it in production, but I am consistently paranoid about this happening any day.

Basically spamming txs over p2p with 100's of signatures with one being invalid (e.g. as offered by IBC), and taking all my nodes compute resources to verify these. We don't want these to get on-chain to not consume all chain resources, thus its a costless attack vector today. (Traditionally this would be protected against via gas / fees, but IBC is a case we made a standard to not do this for)

@jmalicevic jmalicevic merged commit 38f4756 into main Mar 29, 2023
@jmalicevic jmalicevic deleted the jasmina/p2p-bad-peers branch March 29, 2023 22:47
@thanethomson thanethomson added rfc Request for comments and removed wip Work in progress labels Jul 5, 2023
cometcrafter pushed a commit to graphprotocol/cometbft that referenced this pull request Jun 1, 2024
cometbft#78)

Closes cometbft#2854

Follow up from cometbft#2960

This PR adds a simplistic 1-element block validation cache in
`BlockExecutor`. This addresses a performance bottleneck raised as part
of cometbft#2854

---

- [ ] Tests written/updated
- [ ] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)
- [ ] Updated relevant documentation (`docs/` or `spec/`) and code
comments
- [ ] Title follows the [Conventional
Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec

---------

Co-authored-by: Andy Nogueira <[email protected]>
(cherry picked from commit 2dd4e03)

Co-authored-by: Sergio Mena <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation mempool p2p rfc Request for comments
Projects
No open projects
Status: Done
Development

Successfully merging this pull request may close these issues.

mempool/p2p: Research implications of peer disconnect based on ResponseCheckTx
7 participants