Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chain/state store improvements: REDESIGN (segregation, two-tier stores, archival, etc) #4753

Closed
raulk opened this issue Nov 6, 2020 · 2 comments
Assignees
Labels
area/chain/state effort/weeks Effort: Multiple Weeks P2 P2: Should be resolved

Comments

@raulk
Copy link
Member

raulk commented Nov 6, 2020

Analysis of status quo

  • Right now, chain data and state data live in a ginormous monolith store.
  • This store only grows, it never shrinks. Lotus doesn't do any active management on this store, nor GC.
  • Lotus doesn't offer configurable retention policies. The user can't specify:
    • please only retain the full chain, but only state objects from the last 50000 epochs
    • please only retain chain and state data from this and the last finality ranges
    • please retain everything
  • We do not offer integration with snapshot services. The current snapshot facility is rather makeshift and it requires very active user intervention (manually download a snapshot, stop the node, backup store, replace, start node, etc.) The user experience is... improvable.
    • Snapshots are becoming a necessary element in Filecoin. The experience needs to be integrated end-to-end from a product perspective.
    • That said, snapshots are NOT an object of this issue, but the improvements introduced herein should serve as stepping stones towards a more cohesive experience.
  • Lotus doesn't natively offer different node profiles: "archive", "full node", etc. These are not formally specified, and spontaneously emerge based on how the user operates their node.
    • However, the reality is that each of these profiles requires different policies for store maintenance, which we aren't able to offer because we don't model the profiles to begin with.
  • specs-actors and ADTs do not collaborate with Lotus to advise/hint which objects (e.g. HAMT nodes) have gone out of scope, or been delinked, as a result of state transition.
    • We might need to propagate this information out to higher layers to enable refcounting, or other forms of tracking to feed into GC.
  • Lotus does not prune chain nor state objects from abandoned chain branches.

Consequences of status quo

  • The current state store keeps ever growing at a relatively unpredictable rate because it's influenced by various dynamic factors like number of messages, kinds of messages, state transitions, chain branching, etc.

  • Users can run out of disk spaces unexpectedly, because badger allocates files like SST tables and vlogs in blocks. Such actions are triggered with writes accumulation (which leads to flushing memtables onto L0), and LSM level compactions.

  • When disk space runs out, badger corrupts the store, and terrible things happen, including panics: https://discuss.dgraph.io/t/badger-panics-with-index-out-of-range/11303

  • This situation is unsustainable.

Proposed solutions

  1. ✅ Segregate the chain and state stores into two entirely different blockstore domains, each of which can operate with:

  2. ✅ Further divide each blockstore domain in two tiers:

  3. ✅ Implement an archival process; for the state store, every Finality tipsets (900):

    • asynchronously (background) walk the state tree, tracking all live CIDs in a bloom filter (if we keep a key count, we could size the bloom filter for a specific desired FPP rate). Let's call this BF1.
    • while that happens, track all new CIDs that are being written ("delta set"), during that Finality range. Let's call the delta set Δ1.
    • when the finality range elapses, start the process for the next finality range; let's call the resulting bloom filter and delta set BF2 and Δ2.
    • once 2xFinality have passed, iterate through the active tier store and copy all CIDs that do not match the bloom filter from to the inactive tier. Tombstone those entries in the the active tier.
    • the first time we do this, it'll be quite expensive. Next times it'll become way lighter.
    • mmapped B+ trees like will take this workload much better in the active set. For badger and/or LSM trees, we'll need to Flatten/Compact frequently to actually remove deleted entries.
    • hot/cold blockstore segregation (aka. splitstore) #4992
  4. ✅ Implement a tiered blockstore abstraction, such that we query the active tier and then the inactive tier serially.

  5. When archiving into the inactive store, tag each block with the epoch it was last active at, or use some form of record striping. This enables us to create and configure retention policies as outlined in the Analysis section, e.g. "store up to 50000 epochs in the past". We can run periodic GC by iterating over the inactive store and discarding entries/stripes beyond the window.

  6. Implement a fallback Bitswap trapdoor to fetch objects from the network in case something goes wrong, or the user requests an operation that requires access to chain/state beyond the retention window (Optional chain Bitswap #4717 might be a start).

  7. ✅ Implement the migration, either as an in-place, background process that runs inside Lotus, or as a dedicated external command that runs with exclusive access to the store (i.e. Lotus stopped). The choice/feasibility will depend on the final solution design.

  8. Balance between fsync and no fsync at all.

  9. ✅ Memory watchdog. implement a memory watchdog #5058

Caveats

  • Some commands allow the user to override the chain, e.g. set head/follow/mark-bad, etc. We need to discuss how those commands would affect what's being laid out here.
@raulk raulk changed the title chain/state store improvements: segregation, two-tier stores, retention windows, archival, and more chain/state store improvements: REDESIGN (segregation, two-tier stores, retention windows, archival, and more) Nov 6, 2020
@jennijuju jennijuju added the need/team-input Hint: Needs Team Input label Nov 6, 2020
@anorth
Copy link
Member

anorth commented Nov 8, 2020

Segregate the chain and state stores into two entirely different blockstore domains

For some more context, this is actually desirable/required semantics for the runtime store abstraction presented to the actors. Desired semantics are actually even tougher, requiring a consistent view of state that should prevent an actor Get()ing a block that was not Put() and transitively reachable from the state root in the blockchain history/fork that's actually being evaluated.

This isn't something you need to immediately worry about because, as the issue notes:

given our control of the built-in actor code, we can ensure that the semantics are indistinguishable from having no views, transactions, or garbage collection

But it's something to keep in mind, and ideally make more possible, rather than less possible, for future implementation along with end-user contracts.

@jennijuju jennijuju added the effort/weeks Effort: Multiple Weeks label Nov 11, 2020
@jennijuju jennijuju added this to the Blockstore Improvements milestone Nov 25, 2020
@jennijuju jennijuju added the P2 P2: Should be resolved label Nov 25, 2020
@jennijuju jennijuju added need/analysis Hint: Needs Analysis and removed need/team-input Hint: Needs Team Input labels Jan 11, 2021
@raulk raulk changed the title chain/state store improvements: REDESIGN (segregation, two-tier stores, retention windows, archival, and more) chain/state store improvements: REDESIGN (segregation, two-tier stores, archival, etc) Mar 11, 2021
@raulk
Copy link
Member Author

raulk commented Mar 11, 2021

Now that the splitstore shipped as an experiment in v1.5.1, and the memory watchdog has been active and silently keeping memory utilisation within bounds for a few releases, this epic can finally be closed. There are two offshot threads that are tracked separately:

@raulk raulk closed this as completed Mar 11, 2021
@TippyFlitsUK TippyFlitsUK removed the need/analysis Hint: Needs Analysis label Mar 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/chain/state effort/weeks Effort: Multiple Weeks P2 P2: Should be resolved
Projects
None yet
Development

No branches or pull requests

5 participants