-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move account data to persistent storage #2279
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the thread count in the replay stage or the process transactions stage would need to be equal to or higher than the ‘q’ setting on the NVEs. |
a38dca6
to
a37ab7d
Compare
@sambley did you get a chance to test it on our 2 nve machine? |
@sambley, the text you have under "Problem" in the PR description doesn't describe a problem. It's a summary of a solution. What problem are you solving precisely? At what point does the RAM usage of accounts affect a metric? Or what metric will improve if we merge this PR? |
@garious the problem is that the cost of ram is higher than ssds. 16gb of ram is the same cost as 500gb of high speed NVE. so about 30x improvement in cost per full node. Multiply that by 16k fullnodes for an ethereum sized network. |
@aeyakovenko, our cost per fullnode goal is 5k USD. What's the current cost? With a 30x improvement for this particular component, what does that drop the cost to? |
@garious cost per allocated byte would be roughly 0.000121875 U.S. dollars / byte for 15,000 nodes at $130 per 16gb. But you can't really build systems with more than 128gb per system cheaply. Motherboards that support more ram are either more expensive, or are not as flexible for GPUs or other components. It is about $15k for a 1TB of ddr per system, and that board doesn't support GPUs. So the price per byte is likely to be higher since the maximum the entire account space can take up is going to be the smallest node that is in the supermajority (desired finality size). |
@garious at 128gb per system, we end up with a maximum of account number of about 1b, and only if we can optimize the Account instance allocation to fit entirely into 128 bytes. It might be doable, but is going to be hard. |
@anatoly, I tried it out on the 2 nve m/c today and am seeing the average tps to be twice as slow when number of accounts is closer to 100000 and seems to degrade quite drastically for larger number of accounts. Still looking into it to see what is causing the degradation. |
@anatoly, yes I will play around with the settings |
@sambley, my GitHub handle is @aeyakovenko. I wish I was @anatoly :) |
cc #1884 |
@sambley, my guess is performance is better while the file system is in the Linux ram cache. |
@sambley, does this PR assume SSDs are available? Or is there some way to get the original behavior when there are no SSDs available (like on a developer machine)? |
@garious we can figure out how to factor out the |
@aeyakovenko I haven't formatted the drives, @sambley should have access to the machine now though. *edit saw that he already ran the experiment so it's a question for @sambley |
@aeyakovenko, formatted the drives for ext4, so journaling should be enabled. I have rewritten the implementation to use memory mapped I/O and that seems to perform on par for atleast 100000 accounts as expected. Will try out for larger number of accounts and see how it behaves. |
@sambley Awesome! What TPS are you seeing? Journaling might be significantly worse for writes in some cases. We would need to profile with both. I think there might be a bunch of parameters to tune there too. |
@aeyakovenko, its hitting only a mean TPS of 30K, would experiment tuning the different parameters to see which one provides better results. |
@sambley the spec for those has 500,000 random writes per sec with qd32 How many reads and writes are we doing per tx? Can you profile a mmap file on those devices as well? |
@sambley, another thing to try would be to append any new accounts, like one file per bank thread, and store the file+index offset in the tree. Appending is well optimized by all the drivers and the hardware. |
@garious, updated patch with your other review comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an amazing contribution. Thanks so much!
@sakridge, still waiting for changes? |
beb73fd
to
560d8a0
Compare
@garious it doesn't do the fallback to the hashmap-only implementation. It also takes over the Bank::id with another value, I'm not sure if that's safe because I don't know what exactly we are using that for before. |
@sambley sweet, thanks! |
@sambley we also decided the fallback to hashmap-only implementation is not necessary today, so prioritize that last. |
218a25b
to
5ef7c3a
Compare
Add Accounts::has_accounts function for hash_internal_state calculation.
- Fix format check warnings
Also reduce some code duplication with cleanup_dirs fn.
This looks good to me. I'm okay with this being merged if @sakridge approves. The quantity of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems good to me.
Problem
The cost of RAM is much higher that adds up to the cost of operating a full node (16GB of RAM is the same cost as 500GB of high speed NVMe SSDs). Look into ways to reduce the RAM usage by moving some of the data onto SSDs and have them loaded / stored on demand.
Summary of Changes
Implements #2769
To help reduce RAM usage of the nodes, persist storage of accounts across NVMe SSDs and load / store them on a need basis from SSDs.
Store account information across two files: Index and Data
Index: Contains offset into data
Data: Contains the length followed by the account data
The accounts are split across NVMe SSDs using the pubkey as the key.
TODOs:
Snapshot and version numbering is not planned for this release.
Fixes # #2499