-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Excessive per-block write IO activity #2131
Comments
Tagging @liamsi @jlandrews @mossid as this seems AVL related |
Iavl seems likely to me, @jlandrews iavl benchmarks show that there is a large amount of memory allocated for a single key insert. (On the order of 23kb iirc for simple configurations) We could benchmark over a simulation, time / memory performance with an iavl backend to determine if this is a prelaunch problem. |
If this is new behavior on 8001 I think it's unlikely that the cause is IAVL directly, since the dependency hasn't been changed from 700x, (v0.9.2 in both). It's entirely possible that some change in SDK code discovered a latent problem in IAVL though, or that it's related to pruning somehow. |
I imagine noone had looked at the total writes previously, not that its actually new this release. Or it could be the staking changes have more IAVL writes, so this could have become a noticeable problem without IAVL changing. |
Pruning could be something worthwhile to look at. I recall it causing majorrrr slowdowns in Ethermint profiles. |
To diagnose this further, I'd suggest adding prometheus metrics #2169. |
I just had a quick look at the iavl repo in relation to #2169. I have no intimate knowledge of the codebase. With this disclaimer, it appears to me that if SaveVersion() is invoked it will flush the entire tree to leveldb. Wouldn't it be more efficient/natural to just update the underlying database as the changes are made at the node level? So, basically, just doing incremental updates to the DB when necessary instead of flushing everything on block boundaries. This might make versioning a harder, but perhaps leveldb snapshots could be used here. Anyway, would be good to get some better Prometheus metrics first to determine what is causing it. |
I had a similar concern during Ethermint development. I think this will be addressed in #1952. Correct me if I'm wrong @jlandrews |
To @mdyring's point, the current behavior of SaveVersion is that it will recursively flush new nodes to the db, but any nodes that haven't been changed since the last version won't be saved. This does create a lot of non-uniform usage on block boundaries though. #1952 doesn't change that, it's biggest performance related change is not loading all the roots from disk when the tree is loaded, but otherwise most operations are essentially unchanged. It's definitely doable in theory, but I think it would end up being a question of batching vs streaming updates. Streaming updates could lead to less bursty writes on the block boundaries, but more usage overall. That could end up being better though, so it's really hard to say without more data. I'm currently working on some benchmarks the various operations, so hopefully that can help inform this a bit more. |
Thanks for the clarification @jlandrews |
I am not so concerned about batching/streaming, as the fact that LevelDB feels the need to compact/reorganize this often. It just occured to me that the reason that it might be doing this because there are a boatload of changes on the primary key/index. At least that would be the typically behaviour on other DBs. I haven't dived into the code, so just an idea. But primary key should be something immutable for a node (say, some monotonically incremented id) to avoid reorg/compaction. |
@mdyring Do you have plans to fix this issue in near future? Hoping so, currently it so blocking.... |
I only reported the issue. I have, as you, high hopes that this will be fixed sooner rather than later. :-) |
@mdyring In our case, compiling with cleveldb and fixing multistore usage fixed problem. |
@alessio Are we planning on doing anything here? |
Sounds like we just need to be building with clevledb, and/or replace GoLevelDB with something else. @hleb-albau can you clarify what |
Closing this issue as it seems dead -- please reopen if need be. |
Would appreciate to keep it open until there is a proposed fix that is confirmed working. |
This is blocked on tendermint/tendermint#3378 @mdyring @hleb-albau I'd very much appreciate if you could do some testing with CLevelDB and report back your findings: |
Just ran two nodes alongside each other with cleveldb and goleveldb. Sync speed was comparable and cleveldb is exhibiting the same behavior with often occuring compactions in application.db:
It would be nice with some educated guess as to why compaction is needed this often, it must be due to a lot of changes in key values? |
Maybe the next step here is to expose the compaction parameters and test with different settings. Is this possible? |
@jackzampolin Before tweaking the compaction parameters I think it would be helpful to understand why compaction is needed this often. As the application.db grows, compaction will just get more and more IO intensive as it seems everything is (re-)written. So it should ideally happen very rarely. I am not a LevelDB expert, but from my experience with relational databases, I would only expect compaction/reorganization to be needed if the index is fragmented (in this case there is only one index, the "key"). That could be due to pages being full (so tweakable perhaps via config) or because the key is not immutable per node. Does IAVL ensure that:
|
@jlandrews or @liamsi is there any insight you can share to the above? |
Relevant: syndtr/goleveldb#226 (comment) |
I believe this is the case. If I'm not mistaken, the key in the LevelDB is the hash of the node, which will change if any of its children change, so this could be the reason for the constant changes. Sounds like it could require some significant engineering work to address this, but I believe this is along the lines of the work Alexey was doing for the Ethermint store and in Turbo-Geth |
Copying slack history here to capture some details, specifically how integer ID would be used and a "root" key used to reference the root node integer ID.
|
We have done a great deal work on both an Interblock Cache and another cache in the IAVL tree to help mitigate this issue. Going to go ahead and close this. If there are additional performance concerns please open additional issues. |
Summary of Bug
Excessing write IO activity for each block.
Appears to be upwards of 40 MB written per block, making it unnecessarily hard to spin up a full node and catch up with the network.
The activity seems to be centered around data/gaia.db, excerpt from the LOG file in same directory:
The compaction above seems to coincide with a new block being created, I imagine after a lot of updates to the table.
Steps to Reproduce
Participate in gaia-8001 testnet.
For Admin Use
The text was updated successfully, but these errors were encountered: