Possible race condition between Commit and CheckTx from "to-be-updated" mempool #1091

pdobacz · 2018-01-10T12:31:42Z

I'm not sure whether the cause is in Tendermint, but maybe someone can look into the explanation and say, whether the race condition is possible.

BUG REPORT (?)

Tendermint version
0.14.0, from source 88f5f21

Environment:

OS (e.g. from /etc/os-release):
Ubuntu 16.04
Install tools:
glide install/go install

What happened:
During heavy load of transactions (sent via broadcast_tx_sync) on a single Tendermint node w/ ABCI (communication with the ABCI via tcp socket).

After an ABCI.Commit the first ABCI.CheckTx I get is one of the transactions that should be applied after the transactions from the mempool update (i.e. a "fresh" transaction from broadcast_tx_sync)

What you expected to happen:
After an ABCI.Commit the first ABCI.CheckTx I get is the first one that didn't get into the committed block, i.e. the first one from the mempool update (i.e. an old transaction).

How to reproduce it (as minimally and precisely as possible):
Can't reproduce easily without the ABCI app, but can elaborate on what I think might be the cause.

Commit is called with a lock on the mempool, meaning no calls to CheckTx can start. However, since CheckTx is called async in the mempool connection, some CheckTx might have already "sailed", when the lock is released in the mempool and Commit proceeds.

Then, that spurious CheckTx has not yet "begun" in the ABCI app (stuck in transport?). Instead, ABCI app manages to start to process the Commit. Next, the spurious, "sailed" CheckTx happens in the wrong place.

I have inserted a FlushSync call after the mempool.Lock() and just before a Commit call in https://github.com/tendermint/tendermint/blob/v0.14.0/state/execution.go#L253 and the issue went away.

Anything else do we need to know:
I can provide a patch against v0.14.0 that fixes that (with the FlushSync) but not a PR against develop since I can't work with v0.15.0 just yet.

The text was updated successfully, but these errors were encountered:

zramsay · 2018-01-10T13:34:50Z

I'll let the pros chime in on the bug but sure, push a patch for 0.14.0 - we can cherry-pick the commit for later versions

pdobacz · 2018-01-10T14:08:52Z

this is as far as github allows me to go: v0.14.0...omisego:36714eb1997930171a75e77189580c6ca8978e88

pdobacz · 2018-01-10T16:19:43Z

I found out that the above fix didn't exactly remove the condition, just made it much less frequent, so I missed it in the first run.

I think this is a proper move, where it is the mempool connection that get's flushed, not the consensus one:

v0.14.0...omisego:b394bc73a71d2684e06a400e65fb4cf9ff207500

melekes · 2018-01-23T12:43:51Z

After an ABCI.Commit the first ABCI.CheckTx I get is one of the transactions that should be applied after the transactions from the mempool update (i.e. a "fresh" transaction from broadcast_tx_sync)

Note we're calling FlushSync here https://github.com/tendermint/tendermint/blob/develop/state/execution.go#L147, so maybe we should just move this call up right after Lock

if we call it after, we might receive a "fresh" transaction from `broadcast_tx_sync` before old transactions (which were not committed). Refs #1091 ``` Commit is called with a lock on the mempool, meaning no calls to CheckTx can start. However, since CheckTx is called async in the mempool connection, some CheckTx might have already "sailed", when the lock is released in the mempool and Commit proceeds. Then, that spurious CheckTx has not yet "begun" in the ABCI app (stuck in transport?). Instead, ABCI app manages to start to process the Commit. Next, the spurious, "sailed" CheckTx happens in the wrong place. ```

ebuchman · 2018-01-24T04:57:41Z

Bah. Seems right, thanks for catching this!

To summarize the sequence of events:

mempool.CheckTx(newTx)
mempool.Lock
mempool.Commit
abci.CheckTx(newTx)
...

The issue being that the tx was sent to the socket, but we didnt wait for the app to see it before we move forward with the commit. So then we commit, the app updates its mempool state, but then it sees a tx that should actually now be the last tx in the mempool, not the first!

Neat.

ebuchman · 2018-01-24T19:23:00Z

Merged to develop

ebuchman added the T:bug Type Bug (Confirmed) label Jan 21, 2018

melekes mentioned this issue Jan 23, 2018

call FlushSync before calling CommitSync #1143

Merged

ebuchman closed this as completed Jan 24, 2018

IlyaKarpuk pushed a commit to IlyaKarpuk/tendermint that referenced this issue Feb 21, 2018

merged: tendermint#1091

4134918

ratranqu mentioned this issue Feb 23, 2018

Chain falls over when a consistent load of transactions is thrown at it cosmos/ethermint-archive#363

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible race condition between Commit and CheckTx from "to-be-updated" mempool #1091

Possible race condition between Commit and CheckTx from "to-be-updated" mempool #1091

pdobacz commented Jan 10, 2018

zramsay commented Jan 10, 2018 •

edited

Loading

pdobacz commented Jan 10, 2018 •

edited

Loading

pdobacz commented Jan 10, 2018

melekes commented Jan 23, 2018

ebuchman commented Jan 24, 2018

ebuchman commented Jan 24, 2018

Possible race condition between Commit and CheckTx from "to-be-updated" mempool #1091

Possible race condition between Commit and CheckTx from "to-be-updated" mempool #1091

Comments

pdobacz commented Jan 10, 2018

zramsay commented Jan 10, 2018 • edited Loading

pdobacz commented Jan 10, 2018 • edited Loading

pdobacz commented Jan 10, 2018

melekes commented Jan 23, 2018

ebuchman commented Jan 24, 2018

ebuchman commented Jan 24, 2018

zramsay commented Jan 10, 2018 •

edited

Loading

pdobacz commented Jan 10, 2018 •

edited

Loading