-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: snapshot and pruning functionality #140
Conversation
669c0a4
to
439562f
Compare
a199c6f
to
bddfe10
Compare
Tested with It functions without issues. There is one caveat where, sometimes it happens so that the snapshot cannot be taken since
It is then pruned at the next This should not be a problem, just good to be aware. Example from live node:
Conclusion:
|
Checking data folder for space at height 3540717:
|
Checking data folder at height 3547769:
|
), | ||
maxAgeBlocks: 0, | ||
commitHeight: 499000, | ||
expected: 490000, | ||
expected: 489000, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has changed because now we use snapshot-interval
in place of pruning-keep-every
. Although I added sdk.NewSnapshotOptions(10000, 1))
, snapshot interval is used as another strategy for calculating the block retention height and it just takes precedence over the old strategy used by this unit test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome job! I really like this improved abstraction between pruning and snapshot management you've made.
So far I've mostly gone through the boilerplate and some of the logic. Going to keep reviewing later today
this is awesome!! Id love to get this upstreamed into the sdk |
Thanks. Will push this upstream once approved |
Just want to reference this here: cosmos#11152 As I removed the |
baseapp/baseapp.go
Outdated
rms, ok := app.cms.(*rootmulti.Store) | ||
if !ok { | ||
return errors.New("state sync snapshots require a rootmulti store") | ||
} | ||
if err := rms.GetPruning().Validate(); err != nil { | ||
return err | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this still be inside of an app.snapshotManager != nil
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No because pruning options are set independently from snapshot settings. If we do not explicitly set pruning options, pruning-nothing is set by default.
Pruning can and should work independently from snapshots and vice versa.
However, if snapshot are set, then it supplies snapshot-interval
to the pruning.Manager
. As a result, pruning.Manager
knows which heights to skip until after the snapshot is taken.
Let me know if that makes sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I'm a bit thrown off that app.cms has to be a root multi store here. Perhaps the logic should be:
rms, ok := app.cms.(*rootmulti.Store) | |
if !ok { | |
return errors.New("state sync snapshots require a rootmulti store") | |
} | |
if err := rms.GetPruning().Validate(); err != nil { | |
return err | |
} | |
rms, ok := app.cms.(*rootmulti.Store) | |
if !ok && app.snapshotInterval > 0 { | |
return errors.New("state sync snapshots require a rootmulti store") | |
} else if ok { | |
if err := rms.GetPruning().Validate(); err != nil { | |
return err | |
} | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's correct, there should have been a check similar to the suggested:
if !ok && app.snapshotManager != nil {
return errors.New("state sync snapshots require a rootmulti store")
}
if err := rms.GetPruning().Validate(); err != nil {
return err
}
Updated. Thanks for catching that
type Manager struct { | ||
logger log.Logger | ||
opts *types.PruningOptions | ||
snapshotInterval uint64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think its better for pruning and snapshots to be managed separately, or should one have a reference to the other?
I'm a bit confused about the interaction interface between the two managers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pruning can work independently from snapshots if those are not enabled. In that case, snapshotInterval
is 0 and we do not skip any heights.
However, if snapshot settings are set, then the snapshot manager is also set and the snapshotInterval
within the pruningManager
is non-negative (it is set here).
In such a case, snapshotInterval
functions as the old pruning-keep-every
so that there is no race between pruning and snapshots where we get the "active reader" error in iavl. There is one important difference - when the snapshot is complete, snapshot.Manager
"tells" pruning.Manager
to prune that height away with the following call:
https://github.com/osmosis-labs/cosmos-sdk/blob/8d4d6c7d718804d65330ff24a285392a74434131/snapshots/manager.go#L139
As a result, the snapshotInterval
height is eventually pruned.
I agree that pruning and snapshot managers are coupled to each other. However, pruning and snapshots are coupled functionally so there is no way without creating the 2 components without these interactions. Functionally, the 2 components are independent - you can set pruning settings without snapshots and you can set snapshot settings without pruning.
Please let me know if I can explain anything further or if you have any suggestions on what to change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To help with understanding this complexity, I can create UML and sequence diagrams if you think that would be helpful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh the pruning manager is being set here: https://github.com/osmosis-labs/cosmos-sdk/pull/140/files#diff-84a22ba89cc7868bbafc59e07c35ba0bffb05115962cb885bf39eef1a1a8a66eR70
So: Pruning manager always needed, Snapshot manager optional, if snapshot manager set, must update a certain param in the pruning manager? It does so on initialization of snapshot manager.
If this parameter is set, the pruning manager will not delete any snapshot height, and it is up to the snapshot manager to prune it?
I don't think we need a UML diagram, would be convenient if we can add a note to this effect in the README for the pruning and snapshot folder though. (Unless theres a better place for it to live) And on the SetSnapshotInterval function https://github.com/osmosis-labs/cosmos-sdk/pull/140/files#diff-309b9123bdfc80cc367cc95308580b5489e26406b9f5f0920ac55c5b198a5a33R112-R115
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also if this is the case, the interaction front now 'clicks' for me!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh the pruning manager is being set here: https://github.com/osmosis-labs/cosmos-sdk/pull/140/files#diff-84a22ba89cc7868bbafc59e07c35ba0bffb05115962cb885bf39eef1a1a8a66eR70
Yes, that's how the pruning manager gets its snapshotInterval
member field updated if the snapshot options/manager are set
So: Pruning manager always needed, Snapshot manager optional, if snapshot manager set, must update a certain param in the pruning manager? It does so on initialization of snapshot manager.
That's correct
If this parameter is set, the pruning manager will not delete any snapshot height, and it is up to the snapshot manager to prune it?
That is mostly correct with one subtle difference - snapshot manager "tells" pruning manager to prune the snapshot height once it is done by calling this method. That is, the snapshot manager does not prune itself. Instead, it delegates that to the pruning manager.
would be convenient if we can add a note to this effect in the README for the pruning and snapshot folder though.
Will work on that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the READMEs
@p0mvn could you summarize the main differences between this and the PR in the SDK? Is this PR mainly covering snapshot manager changes? |
Definitely @alexanderbez. Thanks for linking the SDK issue. I should have looked upstream first because the PR you linked is a large subset of the work done here. On top of the work upstream, this PR lets The reason for that is Osmosis nodes attempting to prune a height that is currently under the snapshot. As a result, they would fail with error. By avoiding pruning the "snapshot heights", that issue never occurs. This problem started being particularly evident after the fast storage iavl changes. However, I'd guess that even without the IAVL changes, this is still possible if you have rigorous pruning with frequent snapshots enabled. To temporarily mitigate, we were forced to require validators to have unnecessarily large As a result of the important difference I mentioned above, we are not polluting the disk by keeping the heights forever as is the case with old Other changes that stem from the main one:
|
func (m *Manager) HandleHeightSnapshot(height int64) { | ||
if m.opts.GetPruningStrategy() == types.PruningNothing { | ||
return | ||
} | ||
m.mx.Lock() | ||
defer m.mx.Unlock() | ||
m.logger.Debug("HandleHeightSnapshot", "height", height) // TODO: change log level to Debug | ||
m.pruneSnapshotHeights.PushBack(height) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if the node halts, is it possible that the snapshot height never gets pruned? Or is this list saved to disk anywhere, and loaded from disk?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like we should be calling flushPruningSnapshotHeights
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are flushed here:
cosmos-sdk/store/rootmulti/store.go
Line 993 in a619355
rs.pruningManager.FlushPruningHeights(batch) |
And loaded here:
cosmos-sdk/store/rootmulti/store.go
Line 270 in a619355
if err := rs.pruningManager.LoadPruningHeights(rs.db); err != nil { |
I followed the same logic to how we flush/load commit info and regular pruning heights
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
idt I get it, why don't we push this metadata to disk ASAP?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to merge the PR though, and talk about this in a follow-up issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Issue created: #149
Needs osmosis-labs/iavl#36 to be merged to fix the data race test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM now! Thanks for adding those docs
I think we should make a follow-up issue about flushing the snapshotPruneHeights immediately once one is added, but I agree that whats done now should preserve the existing behavior
Starting burn tests at various configurations. Will run these nodes until tomorrow Tests:
|
Checking data folder size on disk with various configurations. Approx height:
|
Approximate height: 3710366
|
This is the diff between height @ValarDragon @alexanderbez could you help me understand if this is acceptable, please? For example, I see that on Is this a problem? If yes, do you have any suggestions on how to go about it? |
Also, They should be roughly the same because |
Prune everything On more thought, this might still be fine due to how level db works - we keep adding more and more levels as we compact data. I checked manually by querying the prune everything node. The versions get deleted as we would expect |
Description
Story 1
Since fast storage was implemented, we have had to encourage node operators to have a large
pruning-keep-recent
value. Otherwise, it would be possible to prune a height currently under snapshot and panic. Usingpruning-keep-every
that is equal tosnapshot-interval
was not a recommended solution. Currently,pruning-keep-every
values are never deleted.Story 2
In addition, several node operators reported that their disk size significantly grows over time despite the fact that pruning is enabled. More rigorous testing is required to eliminate the possibility of pruning errors.
Story 3
Currently,
pruning-keep-every
is used for the same purpose assnapshot-interval
.Story 4
The various combinations of snapshot and pruning are not well-understood and not rigorously tested.
In This Pull Request
Pruning & Snapshot Configurations Update
pruning-keep-every
is removed. Functionally, it is replaced bysnapshot-interval
. However, once the snapshot is complete, the snapshotted height is pruned away. There is no use case forpruning-keep-every
other thansnapshot-interval
Pruning Eventual Consistency
When snapshots are enabled, pruning would have the eventual consistency semantics. That is, some heights (the snapshot heights) are only pruned after the snapshot is complete
Rigorous Testing and Bug Fixes
This PR adds a lot of unit tests at various levels of abstraction, including configuration, pruning and snapshot components, abci.
Notes
pruning-interval
, when height X is being committed, we never prune X itself but only heights that are less than X. In other words, we never prune the current height being committed. However, we use X (the current height) to determine if now is the time to prune.=> So we never end up pruning a height during the same commit while taking a snapshot for that height
E.g.