feat: Retroactive prune very old blocks from store #11064

chillyvee · 2022-01-30T10:49:57Z

Description

Validators using the default pruning config may end up with many old blocks due to pruning-keep-every = "500"

This is a proof of concept showing how a gradual pruning of much older state can be removed without needing to unsafe-reset-all

Questions

Is there any danger to pruning older blocks that are no longer necessary for a "pure" validator? (no API role)

Is extending PruningOptions to include new options from app.toml desirable?

What is the right way to log from store/rootmulti/store.go?

Are we able to somehow see the [state-sync] snapshot-interval / snapshot-keep-recent variables from store/rootmulti/store.go to avoid removing required blocks?

Is there any extensive testing that is desired when implementing this change? At the core it only adds additional heights to be pruned which is not a very special operation.

Since each chain uses a different cosmos-sdk version, is there a proper release/v0.##.x branch(es) to target first/later?

Author Checklist

I have...

Reviewers Checklist

I have...

confirmed the correct type prefix in the PR title
confirmed ! in the type prefix if API or client breaking change
confirmed all author checklist items have been addressed
reviewed state machine logic
reviewed API design and naming
reviewed documentation is accurate
reviewed tests and test coverage
manually tested (if applicable)

blushi · 2022-01-31T16:13:32Z

is this still WIP? If it is, please turn this into a draft PR, thanks.

tac0turtle · 2022-01-31T16:47:31Z

@chillyvee could we first open against master then can back port it here

alexanderbez · 2022-04-14T09:48:34Z

Are you thinking table-driven like this? https://github.com/cosmos/cosmos-sdk/blob/master/store/rootmulti/store_test.go#L472

With testCases := []struct { THIS_TABLE? }

Yes, exactly. You'll find them all throughout the SDK.

chillyvee · 2022-04-14T10:03:30Z

Let me try and if I get stuck, I'll let you know. Might have time next week. Thanks!

…e heights from pruneHist

chillyvee · 2022-04-20T17:39:42Z

Okay some progress, and an opportunity for feedback:

Added a few tests (They pass)
TestMultiStore_PruningHistoricalEarlyFinish - Tests that pruning doesn't kill early blockchains
TestMultiStore_PruningHistoricalDecrease - Test KeepRecent setting from 100->10
TestMultiStore_PruningHistoricalIncrease - Test KeepRecent setting from 10->30

Didn't do it table driven since it's really hard to understand in a table format. But Increase and Decrease as a hard test case is more obvious.

Adjusted TestMultiStore_PruningRestart - historical prune can duplicate heights to prune that are normally added to pruneHeights. If deleting a height twice is a really bad performance hit on the db layer, we can check for duplicates ahead of time before adding heights to the pruneHeights array.

There is a local function variable "debuginfo" that you can set to true to print debug to understand what is going on (kind of). I noticed there is no logging infra and wasn't sure if that was intentional. In any case, happy to remove all prints in the final commit after we agree function is okay.

Not sure if the variable Store.histPruneAmount needs to be configurable. It defines the number of historical heights to "additionally prune" in the backwards direction each "Interval".

…it for now

chillyvee · 2022-04-30T10:43:09Z

Checking in to see if there's any additional feedback to make this PR even better :)

alexanderbez · 2022-05-02T14:05:00Z

@p0mvn did some excellent work in refactoring the abstractions of pruning and state sync recently. I would love his input here too with what you're trying to do.

Are you looking for feedback on the PR as it is right now?

chillyvee · 2022-05-03T00:22:28Z

Yes @alexanderbez would be great to get feedback as PR stands now. If @p0mvn has more suggestions based on his work I’m open to it.

p0mvn · 2022-05-03T02:59:17Z

Hi everyone,

@alexanderbez, thank you, hope I can help

@chillyvee, nice to e-meet you

First of all, I can see that a lot of thought has been put into this, great work!

I looked at the implementation and skimmed the discussion. I have the following question about the current design:

how are we handling the case when retroactive historical heights are partially pruned and the node crashes?
- It seems that historical pruning progress is lost on restart after failure

Also, I would like to note that the abstractions have changed in #11496. Now, there is a pruning manager that is responsible for handling all the pruning logic. If you're planning to continue with this, could you please merge/rebase your branch against main and let it work with the new abstraction (moving all pruning handling logic to the manager)?

Although you've put some excellent work into this PR, I'm concerned about doing the additional historical pruning on top of the regular one. As I mentioned above, the pruning state seems to be undefined if a node crashes with a historical state in progress. Also, I think it introduces a level of complexity that makes it more brittle and difficult to maintain.

I see the following suggestion above by Marko :

there is another way to do this that would allow for way smaller dbs. Basically we would add two commands. One is to snapshot the current height, the second is to restore a iavl tree from a snapshot. This would in turn delete application.db and recreate it with only one version instead of deleting and pruning.

What do you think about this approach? I feel like this could be a safer way without the weirdness after a node crash and extra complexity to maintain. I understand that this approach is required to be done during start-up, and I think that's where it should be done. Implementing the retroactive pruning during Commit(...) is hard to get right with all the edge cases related to where a node may crash and other concurrency issues.

chillyvee · 2022-05-03T03:26:43Z

@p0mvn nice to meet you and thank you for the comments.

Adopting the prune manager should be fine.

Regarding a crash during prune - doesn’t seem like a new problem since a daemon could crash during prune “right now” in the current branch and just leave garbage behind anyways.

A crash pruning now-100 vs pruning (now-100 and now-10000) does not seem any different in risk.

Also we are pruning heights that we would pruned anyway with another config, so it doesn’t seem like a radical thing to do. Is it more risky to prune height 12345 tomorrow than today?

Feel free to tell me why I am wrong since I absolutely have no idea about what kind of risks old block pruning has :) I figure if a block is safe to prune now, it's safe to prune tomorrow as well.

We could keep a key/value bookmark, but again my intent is not to be fast, but rather be done “some day”. So restarting at the new height and working backwards is acceptable in this “eventual” approach. Pruning heights that don’t exist more than once is merely inefficient as far as I can tell.

I could make the prune more aggressive and time based (as much as possible in 500ms) but that makes it even harder to test. Would be better to know when the process is least busy, but again finishing the complete prune quickly is not a goal. If you like this idea, can you give feedback on what triggers should cause a stop of additional prune heights such as a prevote starting or a certain amount of gas used by cosmwasm? Might need to hook in other modules, and maybe it’s better to leave for a future revision.

To discuss the snapshot and restart approach 1) the reason for this approach is to avoid downtime 2) seeing that Juno takes about 40 minutes to start from genesis skipping invariant checks or 6-10 hours with invariant checks, I hesitate against an approach requiring a rebuild.

Is a snapshot and db swap much faster than I fear?

Let me know what you think.

faddat · 2022-05-09T13:39:01Z

@chillyvee I just wanna cheer you on re: this PR!

I think it will be tremendously useful for:

validation
relaying
rpc node provision

thank you!

chillyvee · 2022-05-09T14:47:13Z

Thanks @faddat !

We really hope this will be a help to everyone on the validator / relayer / rpc infrastructure level.

It's not a breakthrough thing, but it's an incremental "quality of life" improvement we think everyone will benefit from.

@p0mvn - When you have a chance, would love to know what you think :)

chillyvee · 2022-05-14T17:58:50Z

Checking in to see if @p0mvn or @alexanderbez has any comment

chillyvee · 2022-05-18T10:03:14Z

Hello! Checking in again for comment :)

alexanderbez

I would have to delegate review to @p0mvn right now as I'm focused on getting v0.46 out. After that's out, I'd be happy to take a look at this again if @p0mvn hasn't already.

p0mvn · 2022-05-19T15:33:15Z

Hi @chillyvee

Thanks for your patience and sorry for the delay.

You've done an incredible job, and I think that retroactive pruning is a useful feature. However, I still have some thoughts in favor of doing it during start-up / not during runtime. I'll try to address this and our previous discussions. Please let me know what you think.

Regarding a crash during prune - doesn’t seem like a new problem since a daemon could crash during prune “right now” in the current branch and just leave garbage behind anyways.

Currently, there is a flush and restore mechanism that prevents the heights from being lost even on a crash:

cosmos-sdk/pruning/manager.go

Lines 130 to 131 in f470d40

    
           // flush the updates to disk so that they are not lost if crash happens. 
        
           if err := m.db.SetSync(pruneHeightsKey, int64SliceToBytes(m.pruneHeights)); err != nil {

With the proposed design, if there is a crash, how is the following information restored?

cosmos-sdk/store/rootmulti/store.go

Lines 57 to 59 in b54cfde

    
           histPruneHeight   int64 // CVCH: -1 Unstarted; 0 Finished; +# Current Height we are pruning down from 
        
           histPruneAmount   int64 // CVCH: Positive Integer of blocks to prune @ Each pruning-interval 
        
           histPruneFinished bool  // CVCH: true at start, then falsewhen histPrune is complete

Additionally, what if you are on a pruning configuration A, you switch to a configuration B and start retroactively pruning the changes during runtime. Then, in the process of doing that, you restart the node and decide to switch from configuration B to C while the system is still reconciling the changes between A and B. How is it going to work next? Is the "in-process" retroactive pruning just lost?

This is another concern. If you rebase / merge main into this branch, you will see that the codebase looks very different. There are also several goroutines involved now that manage the pruning logic. That's another reason why I think that adding the retroactive pruning during runtime would make it hard to test thoroughly. Additionally, if a bug occurs, I don't know how one would go about debugging.

my intent is not to be fast, but rather be done “some day”. So restarting at the new height and working backwards is acceptable in this “eventual” approach

We also want to make sure that we are designing for soundness, readability, and ease of debugging. I don't know how we would achieve all these when doing retroactive pruning during runtime. Additionally, it is important to think about the cases such as swapping out configurations during retroactive pruning. If any edge case in terms of usage is possible, it will most certainly happen. Users are unpredictable.

To discuss the snapshot and restart approach 1) the reason for this approach is to avoid downtime 2) seeing that Juno takes about 40 minutes to start from genesis skipping invariant checks or 6-10 hours with invariant checks, I hesitate against an approach requiring a rebuild.

Between less downtime and more safety, I would prefer safety. Especially, in an environment where multiple chains will use this codebase and developers might not have the best knowledge about how to debug this logic.

That's why I'm advocating for doing the retroactive pruning during start-up.

If you prefer to go with the current approach, I would like to ask you to rebase/merge the main into this branch and we can go from there. However, I'm concerned about whether it is possible to get this change in an acceptable state due to the reasons outlined above.

chillyvee · 2022-05-19T17:43:01Z

That's why I'm advocating for doing the retroactive pruning during start-up.

That's fantastic feedback. Let me check if there are any points that can have a new strategy to address. Otherwise, I'll target pruning on startup.

github-actions · 2022-08-03T00:05:09Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

tac0turtle · 2022-08-03T08:05:53Z

@chillyvee sorry for not coming back to this pr sooner. Do you still have interest in completing it?

chillyvee · 2022-08-03T11:10:00Z

Hi @marbar3778 hope you're doing well. I would like to continue working on this, but it won't be for another month or so. We can leave it open, or if it helps we can close it and reopen later.

Appreciate you checking in! :)

tac0turtle · 2022-08-15T13:01:07Z

gonna close this but keep the branch around

chillyvee · 2022-10-12T15:34:41Z

Attempting to take the snapshot approach. Was able to make some small changes to cosmos-sdk and tendermint to enable local snapshot restores for Juno's uni tesnet. Would you folks give me some comments/guidance on the code? :)

@tac0turtle @alexanderbez @p0mvn @faddat

#13521
tendermint/tendermint#9541

chillyvee requested review from aaronc and alexanderbez as code owners January 30, 2022 10:49

chillyvee changed the title ~~feat: Clean House - prune very old blocks from store~~ feat: (WIP) Clean House - prune very old blocks from store Jan 30, 2022

chillyvee force-pushed the cvch1 branch from 58ecdcb to 113d717 Compare January 30, 2022 10:53

tac0turtle changed the title ~~feat: (WIP) Clean House - prune very old blocks from store~~ feat: Clean House - prune very old blocks from store Jan 31, 2022

tac0turtle marked this pull request as draft January 31, 2022 16:46

chillyvee force-pushed the cvch1 branch from 113d717 to 71adf98 Compare February 1, 2022 12:51

chillyvee added 3 commits April 20, 2022 23:19

CVCH WIP extended test ok

2f7b56c

CVCH Clean Basic Tests

92e8214

CVCH Implement KeepRecent Tests for increase and decrease

a3d7b9d

chillyvee force-pushed the cvch1 branch from eccee04 to a3d7b9d Compare April 20, 2022 17:20

CVCH: Adjust original tests for PruningRestart to accomodate duplicat…

9b3dd8c

…e heights from pruneHist

CVCH - fix example that only applies to earlier cosmos-sdk, but keep …

b54cfde

…it for now

alexanderbez reviewed May 19, 2022

View reviewed changes

yihuang mentioned this pull request Jul 26, 2022

store/rootmulti: Add a cli cmd to prune old states according to current settings #12727

Closed

github-actions bot added the stale label Aug 3, 2022

github-actions bot removed the stale label Aug 4, 2022

tac0turtle closed this Aug 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Retroactive prune very old blocks from store #11064

feat: Retroactive prune very old blocks from store #11064

chillyvee commented Jan 30, 2022 •

edited

Loading

blushi commented Jan 31, 2022

tac0turtle commented Jan 31, 2022

alexanderbez commented Apr 14, 2022

chillyvee commented Apr 14, 2022

chillyvee commented Apr 20, 2022

chillyvee commented Apr 30, 2022

alexanderbez commented May 2, 2022

chillyvee commented May 3, 2022

p0mvn commented May 3, 2022

chillyvee commented May 3, 2022 •

edited

Loading

faddat commented May 9, 2022

chillyvee commented May 9, 2022

chillyvee commented May 14, 2022

chillyvee commented May 18, 2022

alexanderbez left a comment

p0mvn commented May 19, 2022 •

edited

Loading

chillyvee commented May 19, 2022

github-actions bot commented Aug 3, 2022

tac0turtle commented Aug 3, 2022

chillyvee commented Aug 3, 2022

tac0turtle commented Aug 15, 2022

chillyvee commented Oct 12, 2022 •

edited

Loading

feat: Retroactive prune very old blocks from store #11064

feat: Retroactive prune very old blocks from store #11064

Conversation

chillyvee commented Jan 30, 2022 • edited Loading

Description

Questions

Author Checklist

Reviewers Checklist

blushi commented Jan 31, 2022

tac0turtle commented Jan 31, 2022

alexanderbez commented Apr 14, 2022

chillyvee commented Apr 14, 2022

chillyvee commented Apr 20, 2022

chillyvee commented Apr 30, 2022

alexanderbez commented May 2, 2022

chillyvee commented May 3, 2022

p0mvn commented May 3, 2022

chillyvee commented May 3, 2022 • edited Loading

faddat commented May 9, 2022

chillyvee commented May 9, 2022

chillyvee commented May 14, 2022

chillyvee commented May 18, 2022

alexanderbez left a comment

Choose a reason for hiding this comment

p0mvn commented May 19, 2022 • edited Loading

chillyvee commented May 19, 2022

github-actions bot commented Aug 3, 2022

tac0turtle commented Aug 3, 2022

chillyvee commented Aug 3, 2022

tac0turtle commented Aug 15, 2022

chillyvee commented Oct 12, 2022 • edited Loading

chillyvee commented Jan 30, 2022 •

edited

Loading

chillyvee commented May 3, 2022 •

edited

Loading

p0mvn commented May 19, 2022 •

edited

Loading

chillyvee commented Oct 12, 2022 •

edited

Loading