cosmos-sdk don't have to sync db for each block #14966

yihuang · 2023-02-09T03:44:19Z

Summary

Currently we always WriteSync in commit event of each block, but we don't have to do that because tendermint will replay the blocks for us if we lost some data.

Problem Definition

We do unnecessary sync in db writing.

Proposal

Conservative option:

use Write instead of WriteSync in commit

Aggressive option:

~~disable the WAL?~~ risk losing data when pruning blocks.

The text was updated successfully, but these errors were encountered:

alexanderbez · 2023-02-09T15:32:37Z

Are both atomic? What if Write writes data partially? Is that still OK?

yihuang · 2023-02-09T20:21:31Z

I think atomicity is the same as before, a batch write should be represented as a single entry in WAL.
At least for lsm tree dbs, atomicity rely on two part:

atomically writen into WAL.
atomically written into memtable and eventually flushed to sst file in whole.

I think both are still true without sync.

alexanderbez · 2023-02-10T14:38:12Z

I would like to confirm prior to changing anything ;-)

What is the semantic difference between Write and WriteSync then?

yihuang · 2023-02-10T14:42:06Z

I would like to confirm prior to changing anything ;-)

What is the semantic difference between Write and WriteSync then?

WriteSync will call fdatasync on the WAL file after write, that's my understanding, I can find out some references in the implementation later.

yihuang · 2023-02-10T15:14:56Z

there are some good description of the commit pipeline here, so the atomic unit in rocksdb could be a batch group, which could be larger than a single batch, but definitely won't commit a batch partially.
and here is the rocksdb code comments about the sync option.

ValarDragon · 2023-02-13T11:08:39Z

I'm in support of this -- atomicity should not be done at the layer of a single write to a DB, but instead just a small write for confirming the write was done completely. Else, rewrite last blocks data. This becomes much easier in the IAVL new storage layout.

Even have an IAVL-side optimization trying to improve this: cosmos/iavl#619

alexanderbez · 2023-02-13T16:45:19Z

@yihuang where do you see WriteSync calls apart from flushMetadata? IMO the latter is really in the hot path of performance.

yihuang · 2023-02-13T17:50:25Z

@yihuang where do you see WriteSync calls apart from flushMetadata? IMO the latter is really in the hot path of performance.

Just there, iavl tree seems don't write with sync

ValarDragon · 2023-02-14T07:44:59Z

Oh I misunderstood this -- isn't it important for the metadata to be written synchronously? This is the data that is read from to determine last succesfully completed write to that store AFAIU?

yihuang · 2023-02-14T07:46:27Z

Oh I misunderstood this -- isn't it important for the metadata to be written synchronously? This is the data that is read from to determine last succesfully completed write to that store AFAIU?

the idea is just let tendermint replay some blocks if system crash and lose some data.

ValarDragon · 2023-02-14T07:50:03Z

Hrmm, are we sure theres no edge cases around this?

Agreed that in the average case this is safe. Its equivalent to failing to write midway through commit. But can't we get situations where the store for module A is committed succesfully for height H+2, and for module B is last synced at height H. If we crash, now we have a complex situation to recover from. (We'd need to rollback to H, which makes sense to do, but we don't currently have the software to correctly handle this or apply such rollbacks)

yihuang · 2023-02-14T08:07:11Z

Hrmm, are we sure theres no edge cases around this?

Agreed that in the average case this is safe. Its equivalent to failing to write midway through commit. But can't we get situations where the store for module A is committed succesfully for height H+2, and for module B is last synced at height H. If we crash, now we have a complex situation to recover from. (We'd need to rollback to H, which makes sense to do, but we don't currently have the software to correctly handle this or apply such rollbacks)

our current flow of loading multistore is like this:

infos = loadCommitInfos(ver)
for info, storeName in zip(infos, stores):
    loadStore(storeName, info.CommitId.Version)

I think it's correct here, the basic assumption of the db commit pipeline is it only corrupt at tail, won't corrupt at middle, for example:

1. write store1 at N
2. write store2 at N
3. write commitInfos at N
4. write store1 at N+1
5. write store2 at N+1
6. write commitInfos at N+1

If commitInfo.version is persisted succesfully, it implies the iavl trees at that version are persisted as well, so as long as we always load the commitInfo.version first, and load iavl tree with that version, it's fine.

ValarDragon · 2023-02-14T09:32:38Z

But why does store 2's write getting sync'd to disk imply anything about store 1's write being synced to disk.

Isn't commit info in a potentially separate physical DB than store1 & store2? If its in the same DB then I agree with your reasoning! (Though its not clear to me we want to imagine it being in the same DB longer term)

If they're in separate DB's, couldn't we have a time diagram that looks like:

1. Everything synced to disk
2. write store1 at N+1
3. write store2 at N+1
4. write commitInfos at N+1
5. Store2, commit-infos for N+1 synced to disk
6. write store1 at N+2
7. write store2 at N+2
8. write commitInfos at N+2
9. Store2, commit-infos for N+2 synced to disk

I'm perhaps vastly misunderstanding something about syncing here! Happy to just be told I'm not getting it and move on :)

yihuang · 2023-02-14T09:35:02Z

yeah, I was assuming they are same db, it's more tricky if they are different DBs.
if they are different DBs, I think we need to sync the write of each store.

alexanderbez · 2023-02-14T14:42:20Z

The same DB object is passed to each store.

yihuang · 2023-05-16T07:09:23Z

I'm closing this one, since benchmarks shows that the real cost is the writing the nodes, instead of sync.
And I have a better idea that we can do the whole writing part asynchronously, will create an extra issue for this (#16173).

github-actions bot added the needs-triage Issue that needs to be triaged label Feb 9, 2023

yihuang changed the title ~~cosmos-sdk don't have to sync db~~ cosmos-sdk don't have to sync db for each block Feb 9, 2023

tac0turtle added C:Store and removed needs-triage Issue that needs to be triaged labels Feb 9, 2023

alexanderbez self-assigned this Feb 10, 2023

alexanderbez added this to Cosmos-SDK Feb 10, 2023

github-project-automation bot moved this to 📝 Todo in Cosmos-SDK Feb 10, 2023

alexanderbez mentioned this issue Feb 13, 2023

perf: call Write instead of WriteSync in flushMetadata #15016

Closed

19 tasks

yihuang closed this as completed May 16, 2023

github-project-automation bot moved this from 📝 Todo to 👏 Done in Cosmos-SDK May 16, 2023

yihuang mentioned this issue May 16, 2023

[Feature]: save iavl tree asynchronously to improve throughput #16173

Closed

tac0turtle moved this to 🥳 Done in Cosmos-SDK Nov 16, 2023

tac0turtle removed this from Cosmos-SDK Nov 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cosmos-sdk don't have to sync db for each block #14966

cosmos-sdk don't have to sync db for each block #14966

yihuang commented Feb 9, 2023 •

edited

Loading

alexanderbez commented Feb 9, 2023

yihuang commented Feb 9, 2023 •

edited

Loading

alexanderbez commented Feb 10, 2023

yihuang commented Feb 10, 2023 •

edited

Loading

yihuang commented Feb 10, 2023 •

edited

Loading

ValarDragon commented Feb 13, 2023

alexanderbez commented Feb 13, 2023

yihuang commented Feb 13, 2023

ValarDragon commented Feb 14, 2023

yihuang commented Feb 14, 2023 •

edited

Loading

ValarDragon commented Feb 14, 2023

yihuang commented Feb 14, 2023 •

edited

Loading

ValarDragon commented Feb 14, 2023

yihuang commented Feb 14, 2023 •

edited

Loading

alexanderbez commented Feb 14, 2023

yihuang commented May 16, 2023 •

edited

Loading

cosmos-sdk don't have to sync db for each block #14966

cosmos-sdk don't have to sync db for each block #14966

Comments

yihuang commented Feb 9, 2023 • edited Loading

Summary

Problem Definition

Proposal

alexanderbez commented Feb 9, 2023

yihuang commented Feb 9, 2023 • edited Loading

alexanderbez commented Feb 10, 2023

yihuang commented Feb 10, 2023 • edited Loading

yihuang commented Feb 10, 2023 • edited Loading

ValarDragon commented Feb 13, 2023

alexanderbez commented Feb 13, 2023

yihuang commented Feb 13, 2023

ValarDragon commented Feb 14, 2023

yihuang commented Feb 14, 2023 • edited Loading

ValarDragon commented Feb 14, 2023

yihuang commented Feb 14, 2023 • edited Loading

ValarDragon commented Feb 14, 2023

yihuang commented Feb 14, 2023 • edited Loading

alexanderbez commented Feb 14, 2023

yihuang commented May 16, 2023 • edited Loading

yihuang commented Feb 9, 2023 •

edited

Loading

yihuang commented Feb 9, 2023 •

edited

Loading

yihuang commented Feb 10, 2023 •

edited

Loading

yihuang commented Feb 10, 2023 •

edited

Loading

yihuang commented Feb 14, 2023 •

edited

Loading

yihuang commented Feb 14, 2023 •

edited

Loading

yihuang commented Feb 14, 2023 •

edited

Loading

yihuang commented May 16, 2023 •

edited

Loading