Refactor One Large Commit Into Smaller Batches #619

p0mvn · 2022-11-10T19:15:01Z

Proposed in-collaboration with: @ValarDragon @catShaark

Introduction

Our commit logic does not take into account the constraints of the underlying LSM key-value store backends. As a result, a lot of performance is wasted.

This issue is a proposal to refactor the commit logic to be more aligned with the underlying LSM backends.

Separate One Large Commit into Smaller Batches

We commit once per SDK store, at the end of every block, in ABCI's Commit(). Currently, this logical commit translates directly to a single atomic database commit per SDK store.

Every write incurs some "merkelization overhead" (all of the IAVL inner nodes which must be written). As a result, it is typical to have store writes of size 1-10MB, reaching gigabytes in certain operations. For example, importing blockchain genesis or processing Osmosis Epoch.

We claim here that separating one atomic database commit per store, into several smaller atomic commits, is of significant performance benefit. We sketch the reasons below:

Committing 1GB atomically is much slower than doing 10,000 commits of size 100KB, potentially concurrently, and adding a 'wait group' to tell us when all commits have finished.
- Ethereum has found that 100KB is optimal, and in general most databases are optimized for 100s of KBs to a few MBs 1.
Most relevant DB implementations use contiguous segments of memory for handling changes related to the batch. So even describing these batches requires high amounts of RAM at once (which in Go translates to eventual stop-the-world garbage collecting) 7.
A large batch composed of many small records can require twice as much memory when inserted into an LSM memtable than it is required in the batch itself. Note, that this causes a temporary increase in memory requirements because the batch memory is not freed until it is completely committed 2.
There is a compactions overhead. Compactions happen periodically and merge the LSM SSTables. This helps to keep the number of SSTable files and read performance within acceptable bounds. When we commit a large batch at once, we have to pay this overhead immediately, causing large spikes in RAM and CPU usage 3. Some LSM designs might even cause OOM due to space amplification 4.

The large commit problem is especially notable when importing genesis. Since it persists all genesis data in memory, Osmosis nodes require at least 64 GB RAM to survive the import 5.

To mitigate these problems, we suggest saving to disk in smaller atomic segments instead of one large.

Batch Pre-Allocation

Additionally, LevelDB and, potentially, other backends benefit from batch pre-allocation 6. By limiting the batch max size to a configurable number, we can pre-allocate it according to the limit.

Suggested Design

Step 1: Define `GetSizeBytes` on the `Batch` interface

Refactor cosmos-db Batch interface to have a method GetSizeBytes that returns the current size of the batch in bytes.

Step 2: implement `FlushThresholdBatchDecorator`

Introduce a new component called FlushThresholdBatchDecorator:

// ThresholdFlushBatchDecorator is a 
// decorator around batch that flushes
// to disk as soon as the configurable limit 
// is reached.
type FlushThresholdBatchDecorator struct {
    batch                   db.Batch        // Batched writing buffer.
    batchSizeFlushThreshold uint64 // The maximum size of the batch in bytes before it gets flushed to disk
}

var _ db.Batch = FlushThresholdBatchDecorator{} 

// WithFlushThreshold returns new batch decorated with flush threshold.
func WithFlushThreshold(batch db.Batch, flushThreshold uint64) db.Batch {
    // stub
    
    // pre-allocate the underlying batch buffer to flushThreshold if applicable to the db backend.
}

// Set sets value at the given key to the db.
// If the set grows the underlying batch until batchSizeFlushThreshold is reached,
// the batch is flushed to disk, cleared, and a new one is created with buffer
// pre-allocated to threshold.
func Set (key []byte, value []byte) {
    // stub
    
    // N.B.: use `GetSizeBytes` to
    // keep track of the current batch size.
}

// More db.Batch methods

Step 3: Integrate `FlushThresholdBatchDecorator` into `nodedb`

Introduce an option for for "maximum batch size" here.
Propagate the option all the way to nodeDb.
Replace the batch here with the new decorated FlushThreshold batch.

Step 4: Benchmark and Profile

Compare the performance difference from the old results.

The profiles taken on SDK with IAVL v0.19.4:
https://drive.google.com/drive/u/0/folders/1eMenapAeEjIJmiSEHrqQtoQ_g1KCN0YJ
SDK branch from where profiles were taken: https://github.com/osmosis-labs/cosmos-sdk/tree/roman/profiles

We will need to implement a new benchmark that is isolating the SaveVersion() method and run it against trees of different sizes before and after the suggested design is implemented.

Step 5: Investigate running commits in-parallel

There is no reason for blocking the execution on the overhead stemming from the initialization of multiple batches by the underlying LSM kv-store.

As a result, we can explore running the commits concurrently, or even in parallel.

The text was updated successfully, but these errors were encountered:

alexanderbez · 2022-11-12T04:02:50Z

First of all, excellent write up @p0mvn and team. Very well written. Concise and easy to follow. Bravo 👏

I noticed in the description the notion of importing genesis state multiple times. While this is definitely a concern due to large memory requirements. How does committing state in many batches, i.e. writes, affect reads here? I'm not making the connection. In other words, how do our write patterns affect our read patterns? Also note, genesis importing can now be streamed.
As for the proposal itself, makes sense and I definitely approve! Curious if individual batch commits affect the overall atomicity of the entire commit? I.e. what are the failure conditions if 1/N of the batches fails? How do we capture and represent that?
Finally, who's intending to work on this? I would love to help out.

catShaark · 2022-11-15T17:00:32Z

I noticed in the description the notion of importing genesis state multiple times. While this is definitely a concern due to large memory requirements. How does committing state in many batches, i.e. writes, affect reads here? I'm not making the connection. In other words, how do our write patterns affect our read patterns? Also note, genesis importing can now be streamed.

From what I read about LSM, I also think this should be write performance instead of read performance.

As for the proposal itself, makes sense and I definitely approve! Curious if individual batch commits affect the overall atomicity of the entire commit? I.e. what are the failure conditions if 1/N of the batches fails? How do we capture and represent that?

I think we should treat each of batches the same way we treat each module's commit.

Finally, who's intending to work on this? I would love to help out.

@p0mvn and me is working on the implementation of the proposals.

p0mvn · 2022-11-15T17:56:06Z

Thank you @alexanderbez . Great questions!

I noticed in the description the notion of importing genesis state multiple times. While this is definitely a concern due to large memory requirements. How does committing state in many batches, i.e. writes, affect reads here? I'm not making the connection. In other words, how do our write patterns affect our read patterns? Also note, genesis importing can now be streamed.

In general, write patterns affect read patterns in LSM. The reason is compactions happening behind the scenes. These compactions remove duplicate and deleted values (each delete is an insert with a tombstone value). If compactions happen less frequently, there are more SSTable files with redundant data. The more SSTable files we have, the more we have to search for old keys in the worst case. If we commit more frequently, we also compact and remove redundant data more frequently, leading to better reads.

That being said, the point above is orthogonal to the import genesis problem. Although we perform database commit more often, logically these commits still happen at the end of the block. As a result, the compactions are done around the same point in the execution flow.

The claim we are making with import genesis is that currently when we commit a lot of data at once, all that data is stored in memory under the hood. It then needs to be garbage collected, blocking the execution completely. Moreover, with one large commit, we must handle a heavy compaction load all at once. Merging SSTables increases RAM usage even more. These reasons are what is making the import genesis and Osmosis epoch processing halt for some time. I would say that "stop the world" blocking is what is affecting reads in our case.

As for the proposal itself, makes sense and I definitely approve! Curious if individual batch commits affect the overall atomicity of the entire commit? I.e. what are the failure conditions if 1/N of the batches fails? How do we capture and represent that?

The logical atomicity should be unchanged. We treat the commit as complete only if all smaller batches are committed. If node fails and exits mid-way, it should overwrite partially committed data at the restart. We can use the root hash as the notion of "commit is complete". As a future optimization, we can consider restarting the commit from the latest committed batch instead of overwriting, though that might be more challenging to implement.

Finally, who's intending to work on this? I would love to help out.

As @catShaark said above, he is leading the implementation, and I'm going to be helping out where needed. We are hoping for the change to be small and self-contained so we are likely to drive this to completion in just a few PRs. We would appreciate PR reviews and design discussions

chillyvee · 2022-11-27T16:08:26Z

It seems even a partial 1/N batches would be "unreferenced data" in the underlying DB until the root is written to disk.

By ensuring the root is written last, then any early failure would be unreferenced data.

A re-import of genesis would at worst overwrrite leaf/inner nodes with the exact same data and conclude in committing the root.

Is that understanding correct?

alexanderbez · 2022-11-28T21:08:30Z

I have strong concerns with partially written batches leaving data on disk that is unreferenced or not needed. The node replaying the block in theory should overwrite that data effectively, but it just doesn't seem like an effective way to deal with atomicity. Is there a way we can have a batch of batches? I.e. only commit the batches if all of them succeed?

p0mvn · 2022-11-28T21:37:55Z

@chillyvee yes, that is correct. It doesn't have to be root. It can be any value, indicating that "the commit has finished". The root is a good candidate for that since it is written last in IAVL.

p0mvn · 2022-11-28T21:40:32Z

@alexanderbez In my opinion, what matters is logical atomicity. If we assume that a commit is complete only when a root (or another value indicating commit completion) is written, logical atomicity is preserved.

If we treat any partially-committed data as invalid data on replay, logical atomicity is also preserved.

It seems that your concern might be related to building a good abstraction around logical atomicity. If so, we can implement the following method on FlushThresholdBatchDecorator:

// Write writes the batch, marking it to be complete. Only Close() can be called afterward.
// Prior to calling `Write()`, the batch is not considered to be committed. If the node fails,
// prior to batch calling `Write()`, all data pre-set to that batch must be recommitted when the node restarts. 
func (b *FlushThresholdBatchDecorator) Write() {
     // write a value indicating the end of the logical commit
}

Please let me know what you think

tac0turtle · 2023-05-17T07:15:31Z

this was merged

ValarDragon · 2023-05-17T08:03:21Z

My PR actually only did it in the genesis logic AFAIU

ValarDragon · 2023-05-17T08:04:03Z

Oh my bad was merged here: #653

catShaark · 2023-05-17T08:16:11Z

@ValarDragon, @tac0turtle can you guys reopen this, it is not done yet

tac0turtle · 2023-05-17T08:46:25Z

what is left? @catShaark

catShaark · 2023-05-17T08:50:00Z

We just create the logic, we still haven't integrated it into iavl tree

tac0turtle · 2023-05-17T08:51:43Z

what is blocking? we will assign someone internally to do this as this is taking too long at this point

catShaark · 2023-05-17T08:58:20Z

Ohhh its not blocked now, just that my last pr was taking so long to be merged so I can't proceed.

catShaark · 2023-05-17T09:08:01Z

Should have made the next pr for integration when the last pr was merged last month, sorry

catShaark mentioned this issue Nov 15, 2022

Add Size() method for Batch interface cosmos/cosmos-db#58

Closed

p0mvn assigned p0mvn and catShaark Nov 18, 2022

catShaark mentioned this issue Dec 17, 2022

add NewBatchWithSize to DB interface cosmos/cosmos-db#63

Closed

catShaark mentioned this issue Dec 28, 2022

Implement batch with flusher #652

Closed

ValarDragon mentioned this issue Feb 13, 2023

cosmos-sdk don't have to sync db for each block cosmos/cosmos-sdk#14966

Closed

tac0turtle closed this as completed May 17, 2023

ValarDragon reopened this May 17, 2023

ValarDragon closed this as completed May 17, 2023

tac0turtle unassigned p0mvn and catShaark May 17, 2023

tac0turtle reopened this May 17, 2023

catShaark mentioned this issue May 18, 2023

Add BatchWithFlusher to iavl tree #771

Closed

tac0turtle closed this as completed Aug 30, 2023

p0mvn mentioned this issue Aug 31, 2023

Store: batching writes to the block and state stores cometbft/cometbft#1152

Closed

3 tasks

This was referenced Dec 5, 2023

Batch commits to the state and block store cometbft/cometbft#1040

Closed

perf(store): Batch save Block cometbft/cometbft#1755

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor One Large Commit Into Smaller Batches #619

Refactor One Large Commit Into Smaller Batches #619

p0mvn commented Nov 10, 2022

alexanderbez commented Nov 12, 2022

catShaark commented Nov 15, 2022 •

edited

Loading

p0mvn commented Nov 15, 2022

chillyvee commented Nov 27, 2022

alexanderbez commented Nov 28, 2022

p0mvn commented Nov 28, 2022

p0mvn commented Nov 28, 2022

tac0turtle commented May 17, 2023

ValarDragon commented May 17, 2023

ValarDragon commented May 17, 2023

catShaark commented May 17, 2023

tac0turtle commented May 17, 2023

catShaark commented May 17, 2023

tac0turtle commented May 17, 2023

catShaark commented May 17, 2023

catShaark commented May 17, 2023

Refactor One Large Commit Into Smaller Batches #619

Refactor One Large Commit Into Smaller Batches #619

Comments

p0mvn commented Nov 10, 2022

Introduction

Separate One Large Commit into Smaller Batches

Batch Pre-Allocation

Suggested Design

Step 1: Define GetSizeBytes on the Batch interface

Step 2: implement FlushThresholdBatchDecorator

Step 3: Integrate FlushThresholdBatchDecorator into nodedb

Step 4: Benchmark and Profile

Step 5: Investigate running commits in-parallel

alexanderbez commented Nov 12, 2022

catShaark commented Nov 15, 2022 • edited Loading

p0mvn commented Nov 15, 2022

chillyvee commented Nov 27, 2022

alexanderbez commented Nov 28, 2022

p0mvn commented Nov 28, 2022

p0mvn commented Nov 28, 2022

tac0turtle commented May 17, 2023

ValarDragon commented May 17, 2023

ValarDragon commented May 17, 2023

catShaark commented May 17, 2023

tac0turtle commented May 17, 2023

catShaark commented May 17, 2023

tac0turtle commented May 17, 2023

catShaark commented May 17, 2023

catShaark commented May 17, 2023

Step 1: Define `GetSizeBytes` on the `Batch` interface

Step 2: implement `FlushThresholdBatchDecorator`

Step 3: Integrate `FlushThresholdBatchDecorator` into `nodedb`

catShaark commented Nov 15, 2022 •

edited

Loading