Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Make proper per epoch execution components #18281

Merged
merged 5 commits into from
Jun 25, 2024

Conversation

williampsmith
Copy link
Contributor

@williampsmith williampsmith commented Jun 16, 2024

Description

Make state accumulator a per epoch component. This also requires that we make checkpoint executor a proper per epoch component (previously it was only instantiated once, but run_epoch called once per epoch) since it needs to have a reference to the fresh accumulator after reconfig. Now we actually drop it after the call to run_epoch returns.

Test plan

Passed against 120+ seeds:

./scripts/simtest/seed-search.py simtest --test test_simulated_load_reconfig_with_crashes_and_delays

Release notes

Check each box that your changes affect. If none of the boxes relate to your changes, release notes aren't required.

For each box you select, include information after the relevant heading that describes the impact of your changes that a user might notice and any actions they must take to implement updates.

  • Protocol:
  • Nodes (Validators and Full nodes):
  • Indexer:
  • JSON-RPC:
  • GraphQL:
  • CLI:
  • Rust SDK:

Copy link

vercel bot commented Jun 16, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
sui-docs ✅ Ready (Inspect) Visit Preview 💬 Add feedback Jun 25, 2024 6:35pm
3 Ignored Deployments
Name Status Preview Comments Updated (UTC)
multisig-toolkit ⬜️ Ignored (Inspect) Visit Preview Jun 25, 2024 6:35pm
sui-kiosk ⬜️ Ignored (Inspect) Visit Preview Jun 25, 2024 6:35pm
sui-typescript-docs ⬜️ Ignored (Inspect) Visit Preview Jun 25, 2024 6:35pm

// However, since checkpoint_service shutdown is asynchronous, we can't trust
// strong reference count. However it should still be safe to swap out state
// accumulator since we know we have already reached EndOfPublish
let new_accumulator = Arc::new(StateAccumulator::new(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might make more sense to use Weak in the places that we think shouldn't be holding old refs

Copy link
Contributor Author

@williampsmith williampsmith Jun 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, done

Arc::new(this)
}
}

pub enum StateAccumulator {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that v1 and v2 has identical struct layout, just that they behave differently. This doesn't quite justify the versioned structs, as the two structs are identical.
I think it would be cleaner to:

  1. Use a single struct, but keep a v2 bool flag.
  2. Expose a metics() function that clones the metrics, so that you don't have to put it in sui-node

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re: (2) -- the reason we need sui-node to hold the metrics is because we're making new state accumulators each epoch, and if we re-create the metrics alongside the accumulator, we will panic due to re-registering the same keys against the prometheus registry. This, btw, is the same pattern that is used for validator per epoch components.. relevant metrics are encapsulated in ValidatorComponents and stored in sui-node

Re: (1) -- eh it seems much more readable to me to just have two separate structs and deprecate v1 when we can, rather than have giant if/else matching on a boolean (or having seperate functions depending on the bool value, which would then be structurally equivalent to having two separate structs)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, accumulate_running_root function does not really exist for v1. It seems more awkward to me to have the same object that has functions that are no-op and don't make sense to call depending on the value of a boolean flag, or have the caller decide what to call. The current design abstracts the implementation from the caller while also making the semantics clear based on the host environment.

Anyway, the v1 and v2 isn't permanent. It's just for safe rollout. Only one version will exist after all validators have upgraded and we can do cleanup

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we re-create the metrics alongside the accumulator, we will panic due to re-registering the same keys against the prometheus registry

What I meant is to simply clone the metrics out of the accumulator, instead of from the sui-node. If the struct is flat, then cloning that would be much easier.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see. Makes sense, will do

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

further, the bool flag is probably unnecessary - we already pass in the epoch store to all the methods which need to know which version they are, so they could just call state_accumulator_v2_enabled() to do the dispatch

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok I could see that being cleaner. Let me do that in a separate PR though, it has nothing to do with this PR's objective.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(The metric cloning suggestion is now done in this PR. But do the v1/v2 delegation in a separate PR)

@@ -859,7 +860,7 @@ pub struct CheckpointBuilder {
notify: Arc<Notify>,
notify_aggregator: Arc<Notify>,
effects_store: Arc<dyn TransactionCacheRead>,
accumulator: Arc<StateAccumulator>,
accumulator: Weak<StateAccumulator>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this Weak? We recreate checkpoint builder every epoch, right?

Copy link
Contributor Author

@williampsmith williampsmith Jun 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the comments between @mystenmark and I from a previous commit, but to answer here, yes we recreate checkpoint builder every epoch, but this allows us to enforce the invariant in general that there are no dangling references to state accumulator when we tear it down and replace it. To safeguard against a bug creeping in if some other component later references state accumulator for example.

We can't make that guarantee if we have this remain an Arc because in that case we would race with checkpoint builder shutdown to count the strong references (checkpoint service is shutdown via an async message channel)

@williampsmith williampsmith requested a review from lxfind June 18, 2024 19:15
@williampsmith williampsmith force-pushed the william/per-epoch-execution-components branch from 02b9e25 to 977afbd Compare June 25, 2024 18:33
@williampsmith williampsmith merged commit 4078308 into main Jun 25, 2024
47 checks passed
@williampsmith williampsmith deleted the william/per-epoch-execution-components branch June 25, 2024 19:10
williampsmith added a commit that referenced this pull request Jun 27, 2024
## Description 

Make state accumulator a per epoch component. This also requires that we
make checkpoint executor a proper per epoch component (previously it was
only instantiated once, but `run_epoch` called once per epoch) since it
needs to have a reference to the fresh accumulator after reconfig. Now
we actually drop it after the call to `run_epoch` returns.

## Test plan 

Passed against 120+ seeds:
```
./scripts/simtest/seed-search.py simtest --test test_simulated_load_reconfig_with_crashes_and_delays
```

---

## Release notes

Check each box that your changes affect. If none of the boxes relate to
your changes, release notes aren't required.

For each box you select, include information after the relevant heading
that describes the impact of your changes that a user might notice and
any actions they must take to implement updates.

- [ ] Protocol: 
- [ ] Nodes (Validators and Full nodes): 
- [ ] Indexer: 
- [ ] JSON-RPC: 
- [ ] GraphQL: 
- [ ] CLI: 
- [ ] Rust SDK:

---------

Co-authored-by: Mark Logan <[email protected]>
williampsmith added a commit that referenced this pull request Jun 27, 2024
## Description 

Make state accumulator a per epoch component. This also requires that we
make checkpoint executor a proper per epoch component (previously it was
only instantiated once, but `run_epoch` called once per epoch) since it
needs to have a reference to the fresh accumulator after reconfig. Now
we actually drop it after the call to `run_epoch` returns.

## Test plan 

Passed against 120+ seeds:
```
./scripts/simtest/seed-search.py simtest --test test_simulated_load_reconfig_with_crashes_and_delays
```

---

## Release notes

Check each box that your changes affect. If none of the boxes relate to
your changes, release notes aren't required.

For each box you select, include information after the relevant heading
that describes the impact of your changes that a user might notice and
any actions they must take to implement updates.

- [ ] Protocol: 
- [ ] Nodes (Validators and Full nodes): 
- [ ] Indexer: 
- [ ] JSON-RPC: 
- [ ] GraphQL: 
- [ ] CLI: 
- [ ] Rust SDK:

---------

Co-authored-by: Mark Logan <[email protected]>
williampsmith added a commit that referenced this pull request Jun 27, 2024
## Description 

CP #18281

## Test plan 

How did you test the new or updated feature?

---

## Release notes

Check each box that your changes affect. If none of the boxes relate to
your changes, release notes aren't required.

For each box you select, include information after the relevant heading
that describes the impact of your changes that a user might notice and
any actions they must take to implement updates.

- [ ] Protocol: 
- [ ] Nodes (Validators and Full nodes): 
- [ ] Indexer: 
- [ ] JSON-RPC: 
- [ ] GraphQL: 
- [ ] CLI: 
- [ ] Rust SDK:

Co-authored-by: Mark Logan <[email protected]>
tx-tomcat pushed a commit to tx-tomcat/sui-network that referenced this pull request Jul 29, 2024
## Description 

Make state accumulator a per epoch component. This also requires that we
make checkpoint executor a proper per epoch component (previously it was
only instantiated once, but `run_epoch` called once per epoch) since it
needs to have a reference to the fresh accumulator after reconfig. Now
we actually drop it after the call to `run_epoch` returns.

## Test plan 

Passed against 120+ seeds:
```
./scripts/simtest/seed-search.py simtest --test test_simulated_load_reconfig_with_crashes_and_delays
```

---

## Release notes

Check each box that your changes affect. If none of the boxes relate to
your changes, release notes aren't required.

For each box you select, include information after the relevant heading
that describes the impact of your changes that a user might notice and
any actions they must take to implement updates.

- [ ] Protocol: 
- [ ] Nodes (Validators and Full nodes): 
- [ ] Indexer: 
- [ ] JSON-RPC: 
- [ ] GraphQL: 
- [ ] CLI: 
- [ ] Rust SDK:

---------

Co-authored-by: Mark Logan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants