-
Notifications
You must be signed in to change notification settings - Fork 273
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor One Large Commit Into Smaller Batches #619
Comments
First of all, excellent write up @p0mvn and team. Very well written. Concise and easy to follow. Bravo 👏
|
From what I read about LSM, I also think this should be write performance instead of read performance.
I think we should treat each of batches the same way we treat each module's commit.
@p0mvn and me is working on the implementation of the proposals. |
Thank you @alexanderbez . Great questions!
In general, write patterns affect read patterns in LSM. The reason is compactions happening behind the scenes. These compactions remove duplicate and deleted values (each delete is an insert with a tombstone value). If compactions happen less frequently, there are more SSTable files with redundant data. The more SSTable files we have, the more we have to search for old keys in the worst case. If we commit more frequently, we also compact and remove redundant data more frequently, leading to better reads. That being said, the point above is orthogonal to the import genesis problem. Although we perform database commit more often, logically these commits still happen at the end of the block. As a result, the compactions are done around the same point in the execution flow. The claim we are making with import genesis is that currently when we commit a lot of data at once, all that data is stored in memory under the hood. It then needs to be garbage collected, blocking the execution completely. Moreover, with one large commit, we must handle a heavy compaction load all at once. Merging SSTables increases RAM usage even more. These reasons are what is making the import genesis and Osmosis epoch processing halt for some time. I would say that "stop the world" blocking is what is affecting reads in our case.
The logical atomicity should be unchanged. We treat the commit as complete only if all smaller batches are committed. If node fails and exits mid-way, it should overwrite partially committed data at the restart. We can use the root hash as the notion of "commit is complete". As a future optimization, we can consider restarting the commit from the latest committed batch instead of overwriting, though that might be more challenging to implement.
As @catShaark said above, he is leading the implementation, and I'm going to be helping out where needed. We are hoping for the change to be small and self-contained so we are likely to drive this to completion in just a few PRs. We would appreciate PR reviews and design discussions |
It seems even a partial 1/N batches would be "unreferenced data" in the underlying DB until the root is written to disk. By ensuring the root is written last, then any early failure would be unreferenced data. A re-import of genesis would at worst overwrrite leaf/inner nodes with the exact same data and conclude in committing the root. Is that understanding correct? |
I have strong concerns with partially written batches leaving data on disk that is unreferenced or not needed. The node replaying the block in theory should overwrite that data effectively, but it just doesn't seem like an effective way to deal with atomicity. Is there a way we can have a batch of batches? I.e. only commit the batches if all of them succeed? |
@chillyvee yes, that is correct. It doesn't have to be root. It can be any value, indicating that "the commit has finished". The root is a good candidate for that since it is written last in IAVL. |
@alexanderbez In my opinion, what matters is logical atomicity. If we assume that a commit is complete only when a root (or another value indicating commit completion) is written, logical atomicity is preserved. If we treat any partially-committed data as invalid data on replay, logical atomicity is also preserved. It seems that your concern might be related to building a good abstraction around logical atomicity. If so, we can implement the following method on // Write writes the batch, marking it to be complete. Only Close() can be called afterward.
// Prior to calling `Write()`, the batch is not considered to be committed. If the node fails,
// prior to batch calling `Write()`, all data pre-set to that batch must be recommitted when the node restarts.
func (b *FlushThresholdBatchDecorator) Write() {
// write a value indicating the end of the logical commit
} Please let me know what you think |
this was merged |
My PR actually only did it in the genesis logic AFAIU |
Oh my bad was merged here: #653 |
@ValarDragon, @tac0turtle can you guys reopen this, it is not done yet |
what is left? @catShaark |
We just create the logic, we still haven't integrated it into iavl tree |
what is blocking? we will assign someone internally to do this as this is taking too long at this point |
Ohhh its not blocked now, just that my last pr was taking so long to be merged so I can't proceed. |
Should have made the next pr for integration when the last pr was merged last month, sorry |
Proposed in-collaboration with: @ValarDragon @catShaark
Introduction
Our commit logic does not take into account the constraints of the underlying LSM key-value store backends. As a result, a lot of performance is wasted.
This issue is a proposal to refactor the commit logic to be more aligned with the underlying LSM backends.
Separate One Large Commit into Smaller Batches
We commit once per SDK store, at the end of every block, in ABCI's
Commit()
. Currently, this logical commit translates directly to a single atomic database commit per SDK store.Every write incurs some "merkelization overhead" (all of the IAVL inner nodes which must be written). As a result, it is typical to have store writes of size 1-10MB, reaching gigabytes in certain operations. For example, importing blockchain genesis or processing Osmosis Epoch.
We claim here that separating one atomic database commit per store, into several smaller atomic commits, is of significant performance benefit. We sketch the reasons below:
The large commit problem is especially notable when importing genesis. Since it persists all genesis data in memory, Osmosis nodes require at least 64 GB RAM to survive the import 5.
To mitigate these problems, we suggest saving to disk in smaller atomic segments instead of one large.
Batch Pre-Allocation
Additionally, LevelDB and, potentially, other backends benefit from batch pre-allocation 6. By limiting the batch max size to a configurable number, we can pre-allocate it according to the limit.
Suggested Design
Step 1: Define
GetSizeBytes
on theBatch
interfaceRefactor
cosmos-db
Batch
interface to have a methodGetSizeBytes
that returns the current size of the batch in bytes.Step 2: implement
FlushThresholdBatchDecorator
Introduce a new component called
FlushThresholdBatchDecorator
:Step 3: Integrate
FlushThresholdBatchDecorator
intonodedb
Introduce an option for for "maximum batch size" here.
Propagate the option all the way to
nodeDb
.Replace the batch here with the new decorated
FlushThreshold
batch.Step 4: Benchmark and Profile
Compare the performance difference from the old results.
The profiles taken on SDK with IAVL
v0.19.4
:https://drive.google.com/drive/u/0/folders/1eMenapAeEjIJmiSEHrqQtoQ_g1KCN0YJ
SDK branch from where profiles were taken: https://github.com/osmosis-labs/cosmos-sdk/tree/roman/profiles
We will need to implement a new benchmark that is isolating the
SaveVersion()
method and run it against trees of different sizes before and after the suggested design is implemented.Step 5: Investigate running commits in-parallel
There is no reason for blocking the execution on the overhead stemming from the initialization of multiple batches by the underlying LSM kv-store.
As a result, we can explore running the commits concurrently, or even in parallel.
The text was updated successfully, but these errors were encountered: