Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible race condition between Commit and CheckTx from "to-be-updated" mempool #1091

Closed
pdobacz opened this issue Jan 10, 2018 · 6 comments
Closed
Labels
T:bug Type Bug (Confirmed)

Comments

@pdobacz
Copy link

pdobacz commented Jan 10, 2018

I'm not sure whether the cause is in Tendermint, but maybe someone can look into the explanation and say, whether the race condition is possible.

BUG REPORT (?)

Tendermint version
0.14.0, from source 88f5f21

Environment:

  • OS (e.g. from /etc/os-release):
    Ubuntu 16.04

  • Install tools:
    glide install/go install

What happened:
During heavy load of transactions (sent via broadcast_tx_sync) on a single Tendermint node w/ ABCI (communication with the ABCI via tcp socket).

After an ABCI.Commit the first ABCI.CheckTx I get is one of the transactions that should be applied after the transactions from the mempool update (i.e. a "fresh" transaction from broadcast_tx_sync)

What you expected to happen:
After an ABCI.Commit the first ABCI.CheckTx I get is the first one that didn't get into the committed block, i.e. the first one from the mempool update (i.e. an old transaction).

How to reproduce it (as minimally and precisely as possible):
Can't reproduce easily without the ABCI app, but can elaborate on what I think might be the cause.

Commit is called with a lock on the mempool, meaning no calls to CheckTx can start. However, since CheckTx is called async in the mempool connection, some CheckTx might have already "sailed", when the lock is released in the mempool and Commit proceeds.

Then, that spurious CheckTx has not yet "begun" in the ABCI app (stuck in transport?). Instead, ABCI app manages to start to process the Commit. Next, the spurious, "sailed" CheckTx happens in the wrong place.

I have inserted a FlushSync call after the mempool.Lock() and just before a Commit call in https://github.com/tendermint/tendermint/blob/v0.14.0/state/execution.go#L253 and the issue went away.

Anything else do we need to know:
I can provide a patch against v0.14.0 that fixes that (with the FlushSync) but not a PR against develop since I can't work with v0.15.0 just yet.

@zramsay
Copy link
Contributor

zramsay commented Jan 10, 2018

I'll let the pros chime in on the bug but sure, push a patch for 0.14.0 - we can cherry-pick the commit for later versions

@pdobacz
Copy link
Author

pdobacz commented Jan 10, 2018

this is as far as github allows me to go: v0.14.0...omisego:36714eb1997930171a75e77189580c6ca8978e88

@pdobacz
Copy link
Author

pdobacz commented Jan 10, 2018

I found out that the above fix didn't exactly remove the condition, just made it much less frequent, so I missed it in the first run.

I think this is a proper move, where it is the mempool connection that get's flushed, not the consensus one:

v0.14.0...omisego:b394bc73a71d2684e06a400e65fb4cf9ff207500

@ebuchman ebuchman added the T:bug Type Bug (Confirmed) label Jan 21, 2018
@melekes
Copy link
Contributor

melekes commented Jan 23, 2018

After an ABCI.Commit the first ABCI.CheckTx I get is one of the transactions that should be applied after the transactions from the mempool update (i.e. a "fresh" transaction from broadcast_tx_sync)

Note we're calling FlushSync here https://github.com/tendermint/tendermint/blob/develop/state/execution.go#L147, so maybe we should just move this call up right after Lock

melekes added a commit that referenced this issue Jan 23, 2018
if we call it after, we might receive a "fresh" transaction from
  `broadcast_tx_sync` before old transactions (which were not
  committed).

Refs #1091

```
Commit is called with a lock on the mempool, meaning no calls to CheckTx
can start. However, since CheckTx is called async in the mempool
connection, some CheckTx might have already "sailed", when the lock is
released in the mempool and Commit proceeds.

Then, that spurious CheckTx has not yet "begun" in the ABCI app (stuck
in transport?). Instead, ABCI app manages to start to process the
Commit. Next, the spurious, "sailed" CheckTx happens in the wrong place.
```
@ebuchman
Copy link
Contributor

Bah. Seems right, thanks for catching this!

To summarize the sequence of events:

  • mempool.CheckTx(newTx)
  • mempool.Lock
  • mempool.Commit
  • abci.CheckTx(newTx)
  • ...

The issue being that the tx was sent to the socket, but we didnt wait for the app to see it before we move forward with the commit. So then we commit, the app updates its mempool state, but then it sees a tx that should actually now be the last tx in the mempool, not the first!

Neat.

@ebuchman
Copy link
Contributor

Merged to develop

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T:bug Type Bug (Confirmed)
Projects
None yet
Development

No branches or pull requests

4 participants