-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Logical Commit(s) are not Atomic #6370
Comments
Seems related to: #5746 |
Please confirm that you're starting the node with a pruning strategy, say You must keep in mind IAVL now keeps versions in-memory only and then periodically flushes to disk. |
Note, I cannot reproduce this on Gaia. |
Have confirmed this behavior as well on kava v0.8.1. Had to shutdown and restart the process ~20 times to get a failure. |
I can confirm this is not the case. This is using pruning=everything from height 0. |
Able to replicate for the scenario of Test Heights: 2500 ++ to 3300 ++ Remedy: |
@AdityaSripal @zmanian more pruning related issues. Seems like when |
I've been able to reproduce this with Gaia
This happened when restarting just as block 798 was in the process of being committed. I have some ideas about what might be causing this, will dig further. |
I've identified the problem: Cosmos SDK commits are not atomic (as @alexanderbez also suspected). The SDK uses multiple logically-independent IAVL stores with a single backing LevelDB database via The problem is further exacerbated by Tendermint not waiting for in-flight blocks to be applied before exiting. The following are LevelDB entries from the broken node, showing the IAVL root nodes of the
i, _ := binary.Uvarint([]byte{0x9d, 0x06})
fmt.Printf("%v", i) // Output: 797 Thus, on startup the root store attempts to load version 797, which has already been deleted by the I believe this is a different problem from cosmos/iavl#261, so I will continue debugging that separately. How to fixI will leave it for the SDK team to fix the commit issue. Ideally, a single ACID database transaction should be used to store all data for a given version, but this probably requires a major architecture change. Barring that, I'd recommend first committing all sub-stores, then updating the root commit data, and only then deleting the old versions - and also to consider the effects of having some IAVL stores with "future" versions already committed, which must be ignored/deleted/replaced. Tendermint should also wait for commits to complete before shutting down, see tendermint/tendermint#5002. |
Excellent @erikgrinaker, this confirms my suspicions -- updates to the logical stores and the root logical store (along with pruning) is NOT atomic. Being that they all use the same underlying physical store, we need to devise a way for them all to use the same batch object. So as you've pointed, some relatively major changes to the SDK and IAVL. This is certainly blocking 0.39. I'll take this on. |
We'll be tackling this when evaluating the future of IAVL / commitment proofs, and its currently blocked on that larger conversation. Handover of IAVL is starting from Interchain GmbH this week. |
Solution to this is being discussed as part of the larger ARD-040 work in #9331 |
More precisely:
Details: #9355 |
ACK |
closing this as its being discussed in store v2 working group |
Summary of Bug
When syncing a node using
--pruning=everything
, it is possible to gracefully stop a node and it fail to restart.Version
cosmos-sdk 0.38.4 (via both Kava v0.8.1 and Gaia)
tendermint 0.33.3 and 0.33.4
Steps to Reproduce
nothing
,everything
andsyncable
.pruning=everything
(both commandline and app.toml behave the same - as expected)Expected behaviour: Node starts and begin syncing again
Actual behaviour: (not 100% of the time, but c 75% of tests)
Anecdotally, if you stop shortly after restarting, this does not happen - however, a few thousand blocks in, it is reasonably consistent behaviour that the node fails in the way described above.
For Admin Use
The text was updated successfully, but these errors were encountered: