Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EPIC: non-breaking but big UX improvements on IAVL #647

Closed
yihuang opened this issue Dec 12, 2022 · 7 comments
Closed

EPIC: non-breaking but big UX improvements on IAVL #647

yihuang opened this issue Dec 12, 2022 · 7 comments

Comments

@yihuang
Copy link
Collaborator

yihuang commented Dec 12, 2022

There are several huge UX improvements that can be done in a non-breaking way, which is also pretty easy to implement.

Faster Startup

Currently the node startup is very slow because of needing to load all roots, it takes half an hour on our production network just to start an archive node.

There is a lazy mode in current code base, it can reduce the startup time to seconds, but the comments warns about using for write operations, but we have tried it on production nodes and don't find any issues yet, reading code also don't show obvious issues with it, so we think it's worthy to give it try.
We propose to do some small adjustments to LazyLoadVersion to make it do the same thing as LoadVersion1 and add an optional flag2 in cosmos-sdk to enable the lazy mode for all operations. It really solves a big UX issue.

Faster Rollback

There are several trivial but significant performance improvement opportunities3 in method LoadVersionForOverwriting:

  • Skip unnecessary subtree traversal when deleting the latest version.
  • Restrict the iteration range when deleting orphans.
  • Remove the partial fast node patches since it'll be fully re-indexed later.

These changes are backward compatible, it just removes some wasted parts.
With the above improvements, together with LazyLoadVersion and disable the fast node index, rollback can finished in 1 second, that's a huge UX improvement for recovering from app-hash mismatch situation.
But when you enable the fast node index, it'll take the time to rebuild the index, that's unavoidable.

Delete Orphan Records To Save Space

Currently iavl tree stores the orphaned nodes for each version which is used to prune versions, it's proven by 4 and 5 that it's possible to do pruning without this information. Although the new approach can reduce db size and improvement main operations, but slows down the pruning operations, so I don't recommend to do this refactoring in v0.19.x, it'll work better with the new node key format. (TODO we can wait for some benchmark numbers to decide later).

But existing archive node operators can already delete the orphan records now to save space now, our testing on testnet archive node shows it can save 24% in application.db, here's an example to use compaction filter to do it for rocksdb backend6.

We can possibly provide the new pruning approach as a separate cli tool for users who need it though.

Footnotes

  1. https://github.com/cosmos/iavl/pull/638

  2. https://github.com/cosmos/cosmos-sdk/pull/14189

  3. https://github.com/cosmos/iavl/pull/636

  4. https://github.com/cosmos/iavl/pull/646

  5. https://github.com/cosmos/iavl/pull/641

  6. https://gist.github.com/yihuang/7031416e592f0a2f85201a775269f6a3

@yihuang
Copy link
Collaborator Author

yihuang commented Dec 14, 2022

Now the rollback speedup is merged thanks @marko, the deleting orphan records don't need code changes to current version.
There's only one thing left in this issue: using lazy mode to speedup startup, WDYT?

@tac0turtle
Copy link
Member

What is the trade off of making lazy default vs not? I saw you have a pr to make it optional

@yihuang
Copy link
Collaborator Author

yihuang commented Dec 14, 2022

What is the trade off of making lazy default vs not? I saw you have a pr to make it optional

Yeah, no need to make it default, just configurable, let user choose, a pruning node operator maybe don't bother to try it, but an archive node operator who spend half hour just to start the node would be desperately want it.
with lazy mode:

  • startup much faster
  • no effects on main operations in consensus state machine, which only need to bumping the latest version.
  • Pruning operations seems not affected, they always seek for versions with db iterators.
  • Some public apis may become slower, for example "VersionExists", but it seems not used by cosmos-sdk.

@tac0turtle
Copy link
Member

makes sense, if its optional lets support it and document it. Should also make sure we remove it in the future.

@tac0turtle
Copy link
Member

once this lands, I can test the release branch on a mainnet

@faddat
Copy link
Contributor

faddat commented Dec 28, 2022

@yihuang hey, I think that there's an underlying issue that you're hitting with the half-hour startup times -- the limit on the number of keys that can be handled efficiently in goleveldb.

@tac0turtle
Copy link
Member

all items have been merged so we can close this

@github-project-automation github-project-automation bot moved this from 📝 Todo to 👏 Done in Cosmos-SDK Jan 31, 2023
@tac0turtle tac0turtle removed this from Cosmos-SDK May 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants