Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor One Large Commit Into Smaller Batches #619

Closed
p0mvn opened this issue Nov 10, 2022 · 16 comments
Closed

Refactor One Large Commit Into Smaller Batches #619

p0mvn opened this issue Nov 10, 2022 · 16 comments

Comments

@p0mvn
Copy link
Member

p0mvn commented Nov 10, 2022

Proposed in-collaboration with: @ValarDragon @catShaark

Introduction

Our commit logic does not take into account the constraints of the underlying LSM key-value store backends. As a result, a lot of performance is wasted.

This issue is a proposal to refactor the commit logic to be more aligned with the underlying LSM backends.

Separate One Large Commit into Smaller Batches

We commit once per SDK store, at the end of every block, in ABCI's Commit(). Currently, this logical commit translates directly to a single atomic database commit per SDK store.

Every write incurs some "merkelization overhead" (all of the IAVL inner nodes which must be written). As a result, it is typical to have store writes of size 1-10MB, reaching gigabytes in certain operations. For example, importing blockchain genesis or processing Osmosis Epoch.

We claim here that separating one atomic database commit per store, into several smaller atomic commits, is of significant performance benefit. We sketch the reasons below:

  • Committing 1GB atomically is much slower than doing 10,000 commits of size 100KB, potentially concurrently, and adding a 'wait group' to tell us when all commits have finished.
    • Ethereum has found that 100KB is optimal, and in general most databases are optimized for 100s of KBs to a few MBs 1.
  • Most relevant DB implementations use contiguous segments of memory for handling changes related to the batch. So even describing these batches requires high amounts of RAM at once (which in Go translates to eventual stop-the-world garbage collecting) 7.
  • A large batch composed of many small records can require twice as much memory when inserted into an LSM memtable than it is required in the batch itself. Note, that this causes a temporary increase in memory requirements because the batch memory is not freed until it is completely committed 2.
  • There is a compactions overhead. Compactions happen periodically and merge the LSM SSTables. This helps to keep the number of SSTable files and read performance within acceptable bounds. When we commit a large batch at once, we have to pay this overhead immediately, causing large spikes in RAM and CPU usage 3. Some LSM designs might even cause OOM due to space amplification 4.

The large commit problem is especially notable when importing genesis. Since it persists all genesis data in memory, Osmosis nodes require at least 64 GB RAM to survive the import 5.

To mitigate these problems, we suggest saving to disk in smaller atomic segments instead of one large.

Batch Pre-Allocation

Additionally, LevelDB and, potentially, other backends benefit from batch pre-allocation 6. By limiting the batch max size to a configurable number, we can pre-allocate it according to the limit.

Suggested Design

Step 1: Define GetSizeBytes on the Batch interface

Refactor cosmos-db Batch interface to have a method GetSizeBytes that returns the current size of the batch in bytes.

Step 2: implement FlushThresholdBatchDecorator

Introduce a new component called FlushThresholdBatchDecorator:

// ThresholdFlushBatchDecorator is a 
// decorator around batch that flushes
// to disk as soon as the configurable limit 
// is reached.
type FlushThresholdBatchDecorator struct {
    batch                   db.Batch        // Batched writing buffer.
    batchSizeFlushThreshold uint64 // The maximum size of the batch in bytes before it gets flushed to disk
}

var _ db.Batch = FlushThresholdBatchDecorator{} 

// WithFlushThreshold returns new batch decorated with flush threshold.
func WithFlushThreshold(batch db.Batch, flushThreshold uint64) db.Batch {
    // stub
    
    // pre-allocate the underlying batch buffer to flushThreshold if applicable to the db backend.
}

// Set sets value at the given key to the db.
// If the set grows the underlying batch until batchSizeFlushThreshold is reached,
// the batch is flushed to disk, cleared, and a new one is created with buffer
// pre-allocated to threshold.
func Set (key []byte, value []byte) {
    // stub
    
    // N.B.: use `GetSizeBytes` to
    // keep track of the current batch size.
}

// More db.Batch methods

Step 3: Integrate FlushThresholdBatchDecorator into nodedb

  • Introduce an option for for "maximum batch size" here.

  • Propagate the option all the way to nodeDb.

  • Replace the batch here with the new decorated FlushThreshold batch.

Step 4: Benchmark and Profile

Compare the performance difference from the old results.

We will need to implement a new benchmark that is isolating the SaveVersion() method and run it against trees of different sizes before and after the suggested design is implemented.

Step 5: Investigate running commits in-parallel

There is no reason for blocking the execution on the overhead stemming from the initialization of multiple batches by the underlying LSM kv-store.

As a result, we can explore running the commits concurrently, or even in parallel.

@alexanderbez
Copy link
Contributor

First of all, excellent write up @p0mvn and team. Very well written. Concise and easy to follow. Bravo 👏

  • I noticed in the description the notion of importing genesis state multiple times. While this is definitely a concern due to large memory requirements. How does committing state in many batches, i.e. writes, affect reads here? I'm not making the connection. In other words, how do our write patterns affect our read patterns? Also note, genesis importing can now be streamed.
  • As for the proposal itself, makes sense and I definitely approve! Curious if individual batch commits affect the overall atomicity of the entire commit? I.e. what are the failure conditions if 1/N of the batches fails? How do we capture and represent that?
  • Finally, who's intending to work on this? I would love to help out.

@catShaark
Copy link
Contributor

catShaark commented Nov 15, 2022

I noticed in the description the notion of importing genesis state multiple times. While this is definitely a concern due to large memory requirements. How does committing state in many batches, i.e. writes, affect reads here? I'm not making the connection. In other words, how do our write patterns affect our read patterns? Also note, genesis importing can now be streamed.

From what I read about LSM, I also think this should be write performance instead of read performance.

As for the proposal itself, makes sense and I definitely approve! Curious if individual batch commits affect the overall atomicity of the entire commit? I.e. what are the failure conditions if 1/N of the batches fails? How do we capture and represent that?

I think we should treat each of batches the same way we treat each module's commit.

Finally, who's intending to work on this? I would love to help out.

@p0mvn and me is working on the implementation of the proposals.

@p0mvn
Copy link
Member Author

p0mvn commented Nov 15, 2022

Thank you @alexanderbez . Great questions!

I noticed in the description the notion of importing genesis state multiple times. While this is definitely a concern due to large memory requirements. How does committing state in many batches, i.e. writes, affect reads here? I'm not making the connection. In other words, how do our write patterns affect our read patterns? Also note, genesis importing can now be streamed.

In general, write patterns affect read patterns in LSM. The reason is compactions happening behind the scenes. These compactions remove duplicate and deleted values (each delete is an insert with a tombstone value). If compactions happen less frequently, there are more SSTable files with redundant data. The more SSTable files we have, the more we have to search for old keys in the worst case. If we commit more frequently, we also compact and remove redundant data more frequently, leading to better reads.

That being said, the point above is orthogonal to the import genesis problem. Although we perform database commit more often, logically these commits still happen at the end of the block. As a result, the compactions are done around the same point in the execution flow.

The claim we are making with import genesis is that currently when we commit a lot of data at once, all that data is stored in memory under the hood. It then needs to be garbage collected, blocking the execution completely. Moreover, with one large commit, we must handle a heavy compaction load all at once. Merging SSTables increases RAM usage even more. These reasons are what is making the import genesis and Osmosis epoch processing halt for some time. I would say that "stop the world" blocking is what is affecting reads in our case.

As for the proposal itself, makes sense and I definitely approve! Curious if individual batch commits affect the overall atomicity of the entire commit? I.e. what are the failure conditions if 1/N of the batches fails? How do we capture and represent that?

The logical atomicity should be unchanged. We treat the commit as complete only if all smaller batches are committed. If node fails and exits mid-way, it should overwrite partially committed data at the restart. We can use the root hash as the notion of "commit is complete". As a future optimization, we can consider restarting the commit from the latest committed batch instead of overwriting, though that might be more challenging to implement.

Finally, who's intending to work on this? I would love to help out.

As @catShaark said above, he is leading the implementation, and I'm going to be helping out where needed. We are hoping for the change to be small and self-contained so we are likely to drive this to completion in just a few PRs. We would appreciate PR reviews and design discussions

@chillyvee
Copy link
Contributor

It seems even a partial 1/N batches would be "unreferenced data" in the underlying DB until the root is written to disk.

By ensuring the root is written last, then any early failure would be unreferenced data.

A re-import of genesis would at worst overwrrite leaf/inner nodes with the exact same data and conclude in committing the root.

Is that understanding correct?

@alexanderbez
Copy link
Contributor

I have strong concerns with partially written batches leaving data on disk that is unreferenced or not needed. The node replaying the block in theory should overwrite that data effectively, but it just doesn't seem like an effective way to deal with atomicity. Is there a way we can have a batch of batches? I.e. only commit the batches if all of them succeed?

@p0mvn
Copy link
Member Author

p0mvn commented Nov 28, 2022

@chillyvee yes, that is correct. It doesn't have to be root. It can be any value, indicating that "the commit has finished". The root is a good candidate for that since it is written last in IAVL.

@p0mvn
Copy link
Member Author

p0mvn commented Nov 28, 2022

@alexanderbez In my opinion, what matters is logical atomicity. If we assume that a commit is complete only when a root (or another value indicating commit completion) is written, logical atomicity is preserved.

If we treat any partially-committed data as invalid data on replay, logical atomicity is also preserved.

It seems that your concern might be related to building a good abstraction around logical atomicity. If so, we can implement the following method on FlushThresholdBatchDecorator:

// Write writes the batch, marking it to be complete. Only Close() can be called afterward.
// Prior to calling `Write()`, the batch is not considered to be committed. If the node fails,
// prior to batch calling `Write()`, all data pre-set to that batch must be recommitted when the node restarts. 
func (b *FlushThresholdBatchDecorator) Write() {
     // write a value indicating the end of the logical commit
}

Please let me know what you think

@tac0turtle
Copy link
Member

this was merged

@ValarDragon
Copy link
Contributor

My PR actually only did it in the genesis logic AFAIU

@ValarDragon ValarDragon reopened this May 17, 2023
@ValarDragon
Copy link
Contributor

Oh my bad was merged here: #653

@catShaark
Copy link
Contributor

@ValarDragon, @tac0turtle can you guys reopen this, it is not done yet

@tac0turtle
Copy link
Member

what is left? @catShaark

@catShaark
Copy link
Contributor

We just create the logic, we still haven't integrated it into iavl tree

@tac0turtle
Copy link
Member

what is blocking? we will assign someone internally to do this as this is taking too long at this point

@catShaark
Copy link
Contributor

Ohhh its not blocked now, just that my last pr was taking so long to be merged so I can't proceed.

@catShaark
Copy link
Contributor

Should have made the next pr for integration when the last pr was merged last month, sorry

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants