-
Notifications
You must be signed in to change notification settings - Fork 11.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core] Make proper per epoch execution components #18281
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
3 Ignored Deployments
|
b7ee7cd
to
1ab2f14
Compare
1ab2f14
to
cecea98
Compare
cecea98
to
63703b0
Compare
63703b0
to
db69b5e
Compare
db69b5e
to
30b3ba4
Compare
// However, since checkpoint_service shutdown is asynchronous, we can't trust | ||
// strong reference count. However it should still be safe to swap out state | ||
// accumulator since we know we have already reached EndOfPublish | ||
let new_accumulator = Arc::new(StateAccumulator::new( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it might make more sense to use Weak
in the places that we think shouldn't be holding old refs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, done
Arc::new(this) | ||
} | ||
} | ||
|
||
pub enum StateAccumulator { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that v1 and v2 has identical struct layout, just that they behave differently. This doesn't quite justify the versioned structs, as the two structs are identical.
I think it would be cleaner to:
- Use a single struct, but keep a v2 bool flag.
- Expose a metics() function that clones the metrics, so that you don't have to put it in sui-node
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re: (2) -- the reason we need sui-node to hold the metrics is because we're making new state accumulators each epoch, and if we re-create the metrics alongside the accumulator, we will panic due to re-registering the same keys against the prometheus registry. This, btw, is the same pattern that is used for validator per epoch components.. relevant metrics are encapsulated in ValidatorComponents
and stored in sui-node
Re: (1) -- eh it seems much more readable to me to just have two separate structs and deprecate v1 when we can, rather than have giant if/else matching on a boolean (or having seperate functions depending on the bool value, which would then be structurally equivalent to having two separate structs)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, accumulate_running_root
function does not really exist for v1. It seems more awkward to me to have the same object that has functions that are no-op and don't make sense to call depending on the value of a boolean flag, or have the caller decide what to call. The current design abstracts the implementation from the caller while also making the semantics clear based on the host environment.
Anyway, the v1 and v2 isn't permanent. It's just for safe rollout. Only one version will exist after all validators have upgraded and we can do cleanup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we re-create the metrics alongside the accumulator, we will panic due to re-registering the same keys against the prometheus registry
What I meant is to simply clone the metrics out of the accumulator, instead of from the sui-node. If the struct is flat, then cloning that would be much easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see. Makes sense, will do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
further, the bool flag is probably unnecessary - we already pass in the epoch store to all the methods which need to know which version they are, so they could just call state_accumulator_v2_enabled()
to do the dispatch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok I could see that being cleaner. Let me do that in a separate PR though, it has nothing to do with this PR's objective.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(The metric cloning suggestion is now done in this PR. But do the v1/v2 delegation in a separate PR)
0ad5c10
to
46c8eb1
Compare
@@ -859,7 +860,7 @@ pub struct CheckpointBuilder { | |||
notify: Arc<Notify>, | |||
notify_aggregator: Arc<Notify>, | |||
effects_store: Arc<dyn TransactionCacheRead>, | |||
accumulator: Arc<StateAccumulator>, | |||
accumulator: Weak<StateAccumulator>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this Weak? We recreate checkpoint builder every epoch, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the comments between @mystenmark and I from a previous commit, but to answer here, yes we recreate checkpoint builder every epoch, but this allows us to enforce the invariant in general that there are no dangling references to state accumulator when we tear it down and replace it. To safeguard against a bug creeping in if some other component later references state accumulator for example.
We can't make that guarantee if we have this remain an Arc
because in that case we would race with checkpoint builder shutdown to count the strong references (checkpoint service is shutdown via an async message channel)
c4caa3c
to
72de287
Compare
da2e5d8
to
02b9e25
Compare
02b9e25
to
977afbd
Compare
## Description Make state accumulator a per epoch component. This also requires that we make checkpoint executor a proper per epoch component (previously it was only instantiated once, but `run_epoch` called once per epoch) since it needs to have a reference to the fresh accumulator after reconfig. Now we actually drop it after the call to `run_epoch` returns. ## Test plan Passed against 120+ seeds: ``` ./scripts/simtest/seed-search.py simtest --test test_simulated_load_reconfig_with_crashes_and_delays ``` --- ## Release notes Check each box that your changes affect. If none of the boxes relate to your changes, release notes aren't required. For each box you select, include information after the relevant heading that describes the impact of your changes that a user might notice and any actions they must take to implement updates. - [ ] Protocol: - [ ] Nodes (Validators and Full nodes): - [ ] Indexer: - [ ] JSON-RPC: - [ ] GraphQL: - [ ] CLI: - [ ] Rust SDK: --------- Co-authored-by: Mark Logan <[email protected]>
## Description Make state accumulator a per epoch component. This also requires that we make checkpoint executor a proper per epoch component (previously it was only instantiated once, but `run_epoch` called once per epoch) since it needs to have a reference to the fresh accumulator after reconfig. Now we actually drop it after the call to `run_epoch` returns. ## Test plan Passed against 120+ seeds: ``` ./scripts/simtest/seed-search.py simtest --test test_simulated_load_reconfig_with_crashes_and_delays ``` --- ## Release notes Check each box that your changes affect. If none of the boxes relate to your changes, release notes aren't required. For each box you select, include information after the relevant heading that describes the impact of your changes that a user might notice and any actions they must take to implement updates. - [ ] Protocol: - [ ] Nodes (Validators and Full nodes): - [ ] Indexer: - [ ] JSON-RPC: - [ ] GraphQL: - [ ] CLI: - [ ] Rust SDK: --------- Co-authored-by: Mark Logan <[email protected]>
## Description CP #18281 ## Test plan How did you test the new or updated feature? --- ## Release notes Check each box that your changes affect. If none of the boxes relate to your changes, release notes aren't required. For each box you select, include information after the relevant heading that describes the impact of your changes that a user might notice and any actions they must take to implement updates. - [ ] Protocol: - [ ] Nodes (Validators and Full nodes): - [ ] Indexer: - [ ] JSON-RPC: - [ ] GraphQL: - [ ] CLI: - [ ] Rust SDK: Co-authored-by: Mark Logan <[email protected]>
## Description Make state accumulator a per epoch component. This also requires that we make checkpoint executor a proper per epoch component (previously it was only instantiated once, but `run_epoch` called once per epoch) since it needs to have a reference to the fresh accumulator after reconfig. Now we actually drop it after the call to `run_epoch` returns. ## Test plan Passed against 120+ seeds: ``` ./scripts/simtest/seed-search.py simtest --test test_simulated_load_reconfig_with_crashes_and_delays ``` --- ## Release notes Check each box that your changes affect. If none of the boxes relate to your changes, release notes aren't required. For each box you select, include information after the relevant heading that describes the impact of your changes that a user might notice and any actions they must take to implement updates. - [ ] Protocol: - [ ] Nodes (Validators and Full nodes): - [ ] Indexer: - [ ] JSON-RPC: - [ ] GraphQL: - [ ] CLI: - [ ] Rust SDK: --------- Co-authored-by: Mark Logan <[email protected]>
Description
Make state accumulator a per epoch component. This also requires that we make checkpoint executor a proper per epoch component (previously it was only instantiated once, but
run_epoch
called once per epoch) since it needs to have a reference to the fresh accumulator after reconfig. Now we actually drop it after the call torun_epoch
returns.Test plan
Passed against 120+ seeds:
Release notes
Check each box that your changes affect. If none of the boxes relate to your changes, release notes aren't required.
For each box you select, include information after the relevant heading that describes the impact of your changes that a user might notice and any actions they must take to implement updates.