You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Right now, chain data and state data live in a ginormous monolith store.
This store only grows, it never shrinks. Lotus doesn't do any active management on this store, nor GC.
Lotus doesn't offer configurable retention policies. The user can't specify:
please only retain the full chain, but only state objects from the last 50000 epochs
please only retain chain and state data from this and the last finality ranges
please retain everything
We do not offer integration with snapshot services. The current snapshot facility is rather makeshift and it requires very active user intervention (manually download a snapshot, stop the node, backup store, replace, start node, etc.) The user experience is... improvable.
Snapshots are becoming a necessary element in Filecoin. The experience needs to be integrated end-to-end from a product perspective.
That said, snapshots are NOT an object of this issue, but the improvements introduced herein should serve as stepping stones towards a more cohesive experience.
Lotus doesn't natively offer different node profiles: "archive", "full node", etc. These are not formally specified, and spontaneously emerge based on how the user operates their node.
However, the reality is that each of these profiles requires different policies for store maintenance, which we aren't able to offer because we don't model the profiles to begin with.
specs-actors and ADTs do not collaborate with Lotus to advise/hint which objects (e.g. HAMT nodes) have gone out of scope, or been delinked, as a result of state transition.
We might need to propagate this information out to higher layers to enable refcounting, or other forms of tracking to feed into GC.
Lotus does not prune chain nor state objects from abandoned chain branches.
Consequences of status quo
The current state store keeps ever growing at a relatively unpredictable rate because it's influenced by various dynamic factors like number of messages, kinds of messages, state transitions, chain branching, etc.
Users can run out of disk spaces unexpectedly, because badger allocates files like SST tables and vlogs in blocks. Such actions are triggered with writes accumulation (which leads to flushing memtables onto L0), and LSM level compactions.
✅ Implement an archival process; for the state store, every Finality tipsets (900):
asynchronously (background) walk the state tree, tracking all live CIDs in a bloom filter (if we keep a key count, we could size the bloom filter for a specific desired FPP rate). Let's call this BF1.
while that happens, track all new CIDs that are being written ("delta set"), during that Finality range. Let's call the delta set Δ1.
when the finality range elapses, start the process for the next finality range; let's call the resulting bloom filter and delta set BF2 and Δ2.
once 2xFinality have passed, iterate through the active tier store and copy all CIDs that do not match the bloom filter from to the inactive tier. Tombstone those entries in the the active tier.
the first time we do this, it'll be quite expensive. Next times it'll become way lighter.
mmapped B+ trees like will take this workload much better in the active set. For badger and/or LSM trees, we'll need to Flatten/Compact frequently to actually remove deleted entries.
When archiving into the inactive store, tag each block with the epoch it was last active at, or use some form of record striping. This enables us to create and configure retention policies as outlined in the Analysis section, e.g. "store up to 50000 epochs in the past". We can run periodic GC by iterating over the inactive store and discarding entries/stripes beyond the window.
Implement a fallback Bitswap trapdoor to fetch objects from the network in case something goes wrong, or the user requests an operation that requires access to chain/state beyond the retention window (Optional chain Bitswap #4717 might be a start).
✅ Implement the migration, either as an in-place, background process that runs inside Lotus, or as a dedicated external command that runs with exclusive access to the store (i.e. Lotus stopped). The choice/feasibility will depend on the final solution design.
Some commands allow the user to override the chain, e.g. set head/follow/mark-bad, etc. We need to discuss how those commands would affect what's being laid out here.
The text was updated successfully, but these errors were encountered:
raulk
changed the title
chain/state store improvements: segregation, two-tier stores, retention windows, archival, and more
chain/state store improvements: REDESIGN (segregation, two-tier stores, retention windows, archival, and more)
Nov 6, 2020
Segregate the chain and state stores into two entirely different blockstore domains
For some more context, this is actually desirable/required semantics for the runtime store abstraction presented to the actors. Desired semantics are actually even tougher, requiring a consistent view of state that should prevent an actor Get()ing a block that was not Put() and transitively reachable from the state root in the blockchain history/fork that's actually being evaluated.
This isn't something you need to immediately worry about because, as the issue notes:
given our control of the built-in actor code, we can ensure that the semantics are indistinguishable from having no views, transactions, or garbage collection
But it's something to keep in mind, and ideally make more possible, rather than less possible, for future implementation along with end-user contracts.
raulk
changed the title
chain/state store improvements: REDESIGN (segregation, two-tier stores, retention windows, archival, and more)
chain/state store improvements: REDESIGN (segregation, two-tier stores, archival, etc)
Mar 11, 2021
Now that the splitstore shipped as an experiment in v1.5.1, and the memory watchdog has been active and silently keeping memory utilisation within bounds for a few releases, this epic can finally be closed. There are two offshot threads that are tracked separately:
Analysis of status quo
Consequences of status quo
The current state store keeps ever growing at a relatively unpredictable rate because it's influenced by various dynamic factors like number of messages, kinds of messages, state transitions, chain branching, etc.
Users can run out of disk spaces unexpectedly, because badger allocates files like SST tables and vlogs in blocks. Such actions are triggered with writes accumulation (which leads to flushing memtables onto L0), and LSM level compactions.
When disk space runs out, badger corrupts the store, and terrible things happen, including panics: https://discuss.dgraph.io/t/badger-panics-with-index-out-of-range/11303
This situation is unsustainable.
Proposed solutions
✅ Segregate the chain and state stores into two entirely different blockstore domains, each of which can operate with:
✅ Further divide each blockstore domain in two tiers:
✅ Implement an archival process; for the state store, every Finality tipsets (900):
✅ Implement a tiered blockstore abstraction, such that we query the active tier and then the inactive tier serially.
When archiving into the inactive store, tag each block with the epoch it was last active at, or use some form of record striping. This enables us to create and configure retention policies as outlined in the Analysis section, e.g. "store up to 50000 epochs in the past". We can run periodic GC by iterating over the inactive store and discarding entries/stripes beyond the window.
Implement a fallback Bitswap trapdoor to fetch objects from the network in case something goes wrong, or the user requests an operation that requires access to chain/state beyond the retention window (Optional chain Bitswap #4717 might be a start).
✅ Implement the migration, either as an in-place, background process that runs inside Lotus, or as a dedicated external command that runs with exclusive access to the store (i.e. Lotus stopped). The choice/feasibility will depend on the final solution design.
Balance between fsync and no fsync at all.
✅ Memory watchdog. implement a memory watchdog #5058
Caveats
The text was updated successfully, but these errors were encountered: