Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cosmos-sdk don't have to sync db for each block #14966

Closed
yihuang opened this issue Feb 9, 2023 · 16 comments
Closed

cosmos-sdk don't have to sync db for each block #14966

yihuang opened this issue Feb 9, 2023 · 16 comments
Assignees
Labels

Comments

@yihuang
Copy link
Collaborator

yihuang commented Feb 9, 2023

Summary

Currently we always WriteSync in commit event of each block, but we don't have to do that because tendermint will replay the blocks for us if we lost some data.

Problem Definition

We do unnecessary sync in db writing.

Proposal

Conservative option:

  • use Write instead of WriteSync in commit

Aggressive option:

@github-actions github-actions bot added the needs-triage Issue that needs to be triaged label Feb 9, 2023
@yihuang yihuang changed the title cosmos-sdk don't have to sync db cosmos-sdk don't have to sync db for each block Feb 9, 2023
@alexanderbez
Copy link
Contributor

Are both atomic? What if Write writes data partially? Is that still OK?

@tac0turtle tac0turtle added C:Store and removed needs-triage Issue that needs to be triaged labels Feb 9, 2023
@yihuang
Copy link
Collaborator Author

yihuang commented Feb 9, 2023

I think atomicity is the same as before, a batch write should be represented as a single entry in WAL.
At least for lsm tree dbs, atomicity rely on two part:

  • atomically writen into WAL.
  • atomically written into memtable and eventually flushed to sst file in whole.

I think both are still true without sync.

@alexanderbez
Copy link
Contributor

I would like to confirm prior to changing anything ;-)

What is the semantic difference between Write and WriteSync then?

@yihuang
Copy link
Collaborator Author

yihuang commented Feb 10, 2023

I would like to confirm prior to changing anything ;-)

What is the semantic difference between Write and WriteSync then?

WriteSync will call fdatasync on the WAL file after write, that's my understanding, I can find out some references in the implementation later.

@alexanderbez alexanderbez self-assigned this Feb 10, 2023
@github-project-automation github-project-automation bot moved this to 📝 Todo in Cosmos-SDK Feb 10, 2023
@yihuang
Copy link
Collaborator Author

yihuang commented Feb 10, 2023

there are some good description of the commit pipeline here, so the atomic unit in rocksdb could be a batch group, which could be larger than a single batch, but definitely won't commit a batch partially.
and here is the rocksdb code comments about the sync option.

@ValarDragon
Copy link
Contributor

I'm in support of this -- atomicity should not be done at the layer of a single write to a DB, but instead just a small write for confirming the write was done completely. Else, rewrite last blocks data. This becomes much easier in the IAVL new storage layout.

Even have an IAVL-side optimization trying to improve this: cosmos/iavl#619

@alexanderbez
Copy link
Contributor

@yihuang where do you see WriteSync calls apart from flushMetadata? IMO the latter is really in the hot path of performance.

@yihuang
Copy link
Collaborator Author

yihuang commented Feb 13, 2023

@yihuang where do you see WriteSync calls apart from flushMetadata? IMO the latter is really in the hot path of performance.

Just there, iavl tree seems don't write with sync

@ValarDragon
Copy link
Contributor

Oh I misunderstood this -- isn't it important for the metadata to be written synchronously? This is the data that is read from to determine last succesfully completed write to that store AFAIU?

@yihuang
Copy link
Collaborator Author

yihuang commented Feb 14, 2023

Oh I misunderstood this -- isn't it important for the metadata to be written synchronously? This is the data that is read from to determine last succesfully completed write to that store AFAIU?

the idea is just let tendermint replay some blocks if system crash and lose some data.

@ValarDragon
Copy link
Contributor

Hrmm, are we sure theres no edge cases around this?

Agreed that in the average case this is safe. Its equivalent to failing to write midway through commit. But can't we get situations where the store for module A is committed succesfully for height H+2, and for module B is last synced at height H. If we crash, now we have a complex situation to recover from. (We'd need to rollback to H, which makes sense to do, but we don't currently have the software to correctly handle this or apply such rollbacks)

@yihuang
Copy link
Collaborator Author

yihuang commented Feb 14, 2023

Hrmm, are we sure theres no edge cases around this?

Agreed that in the average case this is safe. Its equivalent to failing to write midway through commit. But can't we get situations where the store for module A is committed succesfully for height H+2, and for module B is last synced at height H. If we crash, now we have a complex situation to recover from. (We'd need to rollback to H, which makes sense to do, but we don't currently have the software to correctly handle this or apply such rollbacks)

our current flow of loading multistore is like this:

infos = loadCommitInfos(ver)
for info, storeName in zip(infos, stores):
    loadStore(storeName, info.CommitId.Version)

I think it's correct here, the basic assumption of the db commit pipeline is it only corrupt at tail, won't corrupt at middle, for example:

1. write store1 at N
2. write store2 at N
3. write commitInfos at N
4. write store1 at N+1
5. write store2 at N+1
6. write commitInfos at N+1

If commitInfo.version is persisted succesfully, it implies the iavl trees at that version are persisted as well, so as long as we always load the commitInfo.version first, and load iavl tree with that version, it's fine.

@ValarDragon
Copy link
Contributor

But why does store 2's write getting sync'd to disk imply anything about store 1's write being synced to disk.

Isn't commit info in a potentially separate physical DB than store1 & store2? If its in the same DB then I agree with your reasoning! (Though its not clear to me we want to imagine it being in the same DB longer term)

If they're in separate DB's, couldn't we have a time diagram that looks like:

1. Everything synced to disk
2. write store1 at N+1
3. write store2 at N+1
4. write commitInfos at N+1
5. Store2, commit-infos for N+1 synced to disk
6. write store1 at N+2
7. write store2 at N+2
8. write commitInfos at N+2
9. Store2, commit-infos for N+2 synced to disk

I'm perhaps vastly misunderstanding something about syncing here! Happy to just be told I'm not getting it and move on :)

@yihuang
Copy link
Collaborator Author

yihuang commented Feb 14, 2023

yeah, I was assuming they are same db, it's more tricky if they are different DBs.
if they are different DBs, I think we need to sync the write of each store.

@alexanderbez
Copy link
Contributor

The same DB object is passed to each store.

@yihuang
Copy link
Collaborator Author

yihuang commented May 16, 2023

I'm closing this one, since benchmarks shows that the real cost is the writing the nodes, instead of sync.
And I have a better idea that we can do the whole writing part asynchronously, will create an extra issue for this (#16173).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants