chain/state store improvements: REDESIGN (segregation, two-tier stores, archival, etc) #4753

raulk · 2020-11-06T13:23:07Z

Analysis of status quo

Right now, chain data and state data live in a ginormous monolith store.
This store only grows, it never shrinks. Lotus doesn't do any active management on this store, nor GC.
Lotus doesn't offer configurable retention policies. The user can't specify:
- please only retain the full chain, but only state objects from the last 50000 epochs
- please only retain chain and state data from this and the last finality ranges
- please retain everything
We do not offer integration with snapshot services. The current snapshot facility is rather makeshift and it requires very active user intervention (manually download a snapshot, stop the node, backup store, replace, start node, etc.) The user experience is... improvable.
- Snapshots are becoming a necessary element in Filecoin. The experience needs to be integrated end-to-end from a product perspective.
- That said, snapshots are NOT an object of this issue, but the improvements introduced herein should serve as stepping stones towards a more cohesive experience.
Lotus doesn't natively offer different node profiles: "archive", "full node", etc. These are not formally specified, and spontaneously emerge based on how the user operates their node.
- However, the reality is that each of these profiles requires different policies for store maintenance, which we aren't able to offer because we don't model the profiles to begin with.
specs-actors and ADTs do not collaborate with Lotus to advise/hint which objects (e.g. HAMT nodes) have gone out of scope, or been delinked, as a result of state transition.
- We might need to propagate this information out to higher layers to enable refcounting, or other forms of tracking to feed into GC.
Lotus does not prune chain nor state objects from abandoned chain branches.

Consequences of status quo

The current state store keeps ever growing at a relatively unpredictable rate because it's influenced by various dynamic factors like number of messages, kinds of messages, state transitions, chain branching, etc.
Users can run out of disk spaces unexpectedly, because badger allocates files like SST tables and vlogs in blocks. Such actions are triggered with writes accumulation (which leads to flushing memtables onto L0), and LSM level compactions.
When disk space runs out, badger corrupts the store, and terrible things happen, including panics: https://discuss.dgraph.io/t/badger-panics-with-index-out-of-range/11303
This situation is unsustainable.

Proposed solutions

✅ Segregate the chain and state stores into two entirely different blockstore domains, each of which can operate with:
- independent storage engines, suitable to its specific access patterns (e.g. B+, hash indices, LSM, etc.)
- specific caching policies.
- optimised garbage collection/archival processes.
- ✅ done in logical segregation of blockstores + freecache-cached chain and state blockstores #4771
✅ Further divide each blockstore domain in two tiers:
- Active tier: contains objects from the current finality range.
- Inactive tier: contains out of scope objects.
- ✅ hot/cold blockstore segregation (aka. splitstore) #4992
✅ Implement an archival process; for the state store, every Finality tipsets (900):
- asynchronously (background) walk the state tree, tracking all live CIDs in a bloom filter (if we keep a key count, we could size the bloom filter for a specific desired FPP rate). Let's call this BF1.
- while that happens, track all new CIDs that are being written ("delta set"), during that Finality range. Let's call the delta set Δ1.
- when the finality range elapses, start the process for the next finality range; let's call the resulting bloom filter and delta set BF2 and Δ2.
- once 2xFinality have passed, iterate through the active tier store and copy all CIDs that do not match the bloom filter from to the inactive tier. Tombstone those entries in the the active tier.
- the first time we do this, it'll be quite expensive. Next times it'll become way lighter.
- mmapped B+ trees like will take this workload much better in the active set. For badger and/or LSM trees, we'll need to Flatten/Compact frequently to actually remove deleted entries.
- ✅ hot/cold blockstore segregation (aka. splitstore) #4992
✅ Implement a tiered blockstore abstraction, such that we query the active tier and then the inactive tier serially.
- ✅ hot/cold blockstore segregation (aka. splitstore) #4992
When archiving into the inactive store, tag each block with the epoch it was last active at, or use some form of record striping. This enables us to create and configure retention policies as outlined in the Analysis section, e.g. "store up to 50000 epochs in the past". We can run periodic GC by iterating over the inactive store and discarding entries/stripes beyond the window.
- ➡️ tracked in configurable chain and state data retention policies + expunge process #5056.
Implement a fallback Bitswap trapdoor to fetch objects from the network in case something goes wrong, or the user requests an operation that requires access to chain/state beyond the retention window (Optional chain Bitswap #4717 might be a start).
- ✅ done in Optional chain Bitswap #4717 (optional and needs to be activated with an env variable)
✅ Implement the migration, either as an in-place, background process that runs inside Lotus, or as a dedicated external command that runs with exclusive access to the store (i.e. Lotus stopped). The choice/feasibility will depend on the final solution design.
Balance between fsync and no fsync at all.
- ➡️ tracked in introduce Sync() method on blockstores #5057.
✅ Memory watchdog. implement a memory watchdog #5058

Caveats

Some commands allow the user to override the chain, e.g. set head/follow/mark-bad, etc. We need to discuss how those commands would affect what's being laid out here.

anorth · 2020-11-08T23:28:16Z

Segregate the chain and state stores into two entirely different blockstore domains

For some more context, this is actually desirable/required semantics for the runtime store abstraction presented to the actors. Desired semantics are actually even tougher, requiring a consistent view of state that should prevent an actor Get()ing a block that was not Put() and transitively reachable from the state root in the blockchain history/fork that's actually being evaluated.

This isn't something you need to immediately worry about because, as the issue notes:

given our control of the built-in actor code, we can ensure that the semantics are indistinguishable from having no views, transactions, or garbage collection

But it's something to keep in mind, and ideally make more possible, rather than less possible, for future implementation along with end-user contracts.

raulk · 2021-03-11T10:52:11Z

Now that the splitstore shipped as an experiment in v1.5.1, and the memory watchdog has been active and silently keeping memory utilisation within bounds for a few releases, this epic can finally be closed. There are two offshot threads that are tracked separately:

raulk changed the title ~~chain/state store improvements: segregation, two-tier stores, retention windows, archival, and more~~ chain/state store improvements: REDESIGN (segregation, two-tier stores, retention windows, archival, and more) Nov 6, 2020

jennijuju added the area/chain/state label Nov 6, 2020

jennijuju added the need/team-input Hint: Needs Team Input label Nov 6, 2020

jennijuju added the effort/weeks Effort: Multiple Weeks label Nov 11, 2020

jennijuju added this to the Blockstore Improvements milestone Nov 25, 2020

jennijuju assigned vyzo and raulk Nov 25, 2020

jennijuju added the P2 P2: Should be resolved label Nov 25, 2020

This was referenced Nov 30, 2020

configurable chain and state data retention policies + expunge process #5056

Open

introduce memory watchdog; LOTUS_MAX_HEAP #5101

Merged

jennijuju added need/analysis Hint: Needs Analysis and removed need/team-input Hint: Needs Team Input labels Jan 11, 2021

raulk changed the title ~~chain/state store improvements: REDESIGN (segregation, two-tier stores, retention windows, archival, and more)~~ chain/state store improvements: REDESIGN (segregation, two-tier stores, archival, etc) Mar 11, 2021

raulk closed this as completed Mar 11, 2021

raulk mentioned this issue Mar 11, 2021

hot/cold blockstore segregation (aka. splitstore) #4992

Merged

6 tasks

TippyFlitsUK removed the need/analysis Hint: Needs Analysis label Mar 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chain/state store improvements: REDESIGN (segregation, two-tier stores, archival, etc) #4753

chain/state store improvements: REDESIGN (segregation, two-tier stores, archival, etc) #4753

raulk commented Nov 6, 2020 •

edited

Loading

anorth commented Nov 8, 2020

raulk commented Mar 11, 2021

chain/state store improvements: REDESIGN (segregation, two-tier stores, archival, etc) #4753

chain/state store improvements: REDESIGN (segregation, two-tier stores, archival, etc) #4753

Comments

raulk commented Nov 6, 2020 • edited Loading

Analysis of status quo

Consequences of status quo

Proposed solutions

Caveats

anorth commented Nov 8, 2020

raulk commented Mar 11, 2021

raulk commented Nov 6, 2020 •

edited

Loading