[Merged by Bors] - Run fork choice before block proposal #3168

michaelsproul · 2022-05-05T05:36:24Z

Issue Addressed

Upcoming spec change ethereum/consensus-specs#2878

Proposed Changes

Run fork choice at the start of every slot, and wait for this run to complete before proposing a block.
As an optimisation, also run fork choice 3/4 of the way through the slot (at 9s), dequeueing attestations for the next slot.
Remove the fork choice run from the state advance timer that occurred before advancing the state.

Additional Info

Block Proposal Accuracy

This change makes us more likely to propose on top of the correct head in the presence of re-orgs with proposer boost in play. The main scenario that this change is designed to address is described in the linked spec issue.

Attestation Accuracy

This change also makes us more likely to attest to the correct head. Currently in the case of a skipped slot at slot we only run fork choice 9s into slot - 1. This means the attestations from slot - 1 aren't taken into consideration, and any boost applied to the block from slot - 1 is not removed (it should be). In the language of the linked spec issue, this means we are liable to attest to C, even when the majority voting weight has already caused a re-org to B.

Why remove the call before the state advance?

If we've run fork choice at the start of the slot then it has already dequeued all the attestations from the previous slot, which are the only ones eligible to influence the head in the current slot. Running fork choice again is unnecessary (unless we run it for the next slot and try to pre-empt a re-org, but I don't currently think this is a great idea).

Performance

Based on Prater testing this adds about 5-25ms of runtime to block proposal times, which are 500-1000ms on average (and spike to 5s+ sometimes due to state handling issues 😢 ). I believe this is a small enough penalty to enable it by default, with the option to disable it via the new flag --fork-choice-before-proposal-timeout 0. Upcoming work on block packing and state representation will also reduce block production times in general, while removing the spikes.

Implementation

Fork choice gets invoked at the start of the slot via the per_slot_task function called from the slot timer. It then uses a condition variable to signal to block production that fork choice has been updated. This is a bit funky, but it seems to work. One downside of the timer-based approach is that it doesn't happen automatically in most of the tests. The test added by this PR has to trigger the run manually.

michaelsproul · 2022-05-10T01:13:04Z

I've decided we need to move to the fork choice schedule that Prysm use: always run fork choice in the start of the slot, and then wait for that to finish (perhaps with a timeout) when proposing a block. As an optimisation we can also run fork choice preemptively in the state advance timer at 9s through the previous slot.

michaelsproul · 2022-05-10T09:11:04Z

Current TODO:

Fix Clippy
Make the wait timeout dynamic with a new CLI flag
Double-check the full implications for merged networks: forkchoiceUpdated, payload preparation, etc.
Test on a merge network

…e-proposal

michaelsproul · 2022-05-17T04:25:43Z

To establish that the impact of this change on merge networks is minimal, let's consider a few scenarios. We'll assume that the BN needs to propose a block in slot n + 1.

Case 1: Healthy chain, blocks every slot

[n - 1] <- [n] <- [n + 1]

At the start of slot n before the block arrives we run fork choice. We issue one fcU without payload attributes. We skip issuing a second fcU with payload attributes as neither of the conditions here are satisfied (we're not in the last 4s of the slot, and the head slot is n - 1, behind the prepare slot n + 1).
The block arrives on time and we run fork choice again. We issue fcU without payload attributes, and then again with payload attributes because the head is now one slot behind the prepare slot. This yields payloadID#1.
In the state advance timer we run fork choice for slot n + 1 at 9s into slot n. We issue fcU again first without and then with payload attributes. It would seem that the EL would be justified in returning a different payload ID, payloadID#2 here, which we would seemingly use to overwrite payloadID#1 here? If so, I think this is sub-optimal, as it gives the EL less time to prepare a payload. This issue if it exists also exists in the current flow where we run fork choice before the state advance.
We run fork choice at the start of slot n. We do not send any payload attributes because we're too early in the slot, and the prepare_slot is now two slots ahead of the head (= n + 2).

Aside from the 2nd payload ID issue, I think this case looks good: we've added just 1 extra fcU relative to what we have currently.

Case 2: Skip slot at `n`

[n - 1] <-- _ <-- [n + 1]

Same as Case 1 above, except that we don't issue any fcU for the block arriving (because it doesn't). We'll issue an fcU with payload attributes at 9s into slot n, and then propose a payload based on that.

Case 3: Re-org from `n` to `n - 1`

[n - 2] <-- [n - 1] <--  _  <-- [n + 1]
      ^---------------- [n]

This is the case this PR is designed to help with.

At the start of slot n we issue fcU with n - 1 as head. No payload attributes.
Block n arrives and becomes head, we issue fcU with n as head, and we issue a second fcU with payload attributes for a payload building atop n.
In the state advance we run fork choice taking into account the slot n attestations and we re-org to slot n - 1 based on their weight. We issue two new fcU messages for n - 1, including one with payload attributes for building atop n - 1 (because we're later than 8s in the slot). The payload ID cache is keyed by the head block hash, timestamp and other parameters which prevent a collision with the payload ID from step 2.
We issue another fcU at the start of slot n + 1 with n - 1 as head. No payload attributes.
We propose a block upon n - 1 at slot n + 1 using the payload prepared 3+ seconds earlier in step 3.

Case 4: Re-org from `n - 1` to `n`

This is the same as Case 4 except that there's no re-org in step 3, so we re-issue fcUs for the same head at slot n. This suffers from the same payload-forgetfulness issue as Case 1 because we get two payload IDs for building atop block n.

Case 5: Late block in slot `n`

With proposer boost re-org:

[n - 1] <--  _  <-- [n + 1]
      ^---- [n]

Without proposer boost re-org:

[n - 1] <-- [n] <-- [n + 1]

Start of slot n, issue fcU for n - 1 as usual. No payload attributes.
State advance runs before the block arrives: issue fcU for n - 1 again, with payload attributes to build atop n - 1.
Block n arrives and becomes head (regardless of proposer boost). Issue fcU for n with payload attributes to build atop n.
Fork choice runs at the start of slot n + 1 and block n is still head, so we issue another fcU without payload attributes for n. As the proposer we then have a choice between:
a) Re-org out the block at slot n using proposer boost ([Merged by Bors] - Enable proposer boost re-orging #2860). We fetch the payload ID for the parent block at slot n - 1 that was prepared in step 2.
b) Don't re-org out the block at slot n. We fetch the payload ID for the parent block n that was prepared in step 3.

Both variants of this case seem fine as well.

michaelsproul · 2022-05-17T07:22:55Z

In the last week there was one instance of a warning from this branch being tripped on Prater, and it was on the notoriously flaky Singapore node while it was overwhelmed doing a bunch of other stuff:

May 16 14:57:48 prater-sin-bv0 lighthouse-bn[2945737]: May 16 14:57:48.258 WARN Timed out waiting for fork choice before proposal, message: this block may be orphaned, service: beacon
May 16 14:57:55 prater-sin-bv0 lighthouse-bn[2945737]: May 16 14:57:55.170 CRIT Block was broadcast too late root: 0xb84dc63c75611188390bc4d1fd043e7e01a1f998f00ccdd712c88e8adfde0dab, slot: 3017089, delay_ms: 5225, msg: system may be overloaded, block likely to be orphaned

Even so, the timeout was effective, and the block managed to become part of the canonical chain: https://prater.beaconcha.in/block/3017089

michaelsproul · 2022-05-19T05:23:53Z

Based on out-of-band discussions with @paulhauner I've just pushed two commit 06908ad and 9808754 which move the preemptive fork choice run to 500ms before the start of the next slot, rather than 3s before. This has the following advantages:

Allows block proposals up to 500ms before the start of the slot. Previously such block proposals would skip waiting for fork choice and log a warning. Now they'll only log a warning if they time out waiting for fork choice (i.e. they're made more than 750ms before their slot).
Defensively reduces the time during which fork choice is "time warped" (ahead of the wall-clock slot).

The disadvantages are:

Slightly increased complexity in the state advance timer.
If fork choice takes 500ms+ then it impacts timely block proposal more than if it takes 500ms+ at -3s. This seems unlikely and we should probably make sure fork choice doesn't ever take much longer than 500ms. The timeout from block proposal also puts a cap on how late it can be.
On merge networks, we may give the EL less time to prepare a payload in case of a re-org.

In commit 06908ad I also fixed an issue that occurs when the state advance runs over the slot during which it's started, which would cause it to skip the following slot and wait for the next one. I'm not strongly attached to this change and would be happy to revert the offset to its previous value which assumes that the state advance can run to completion before the end of the slot.

…e-proposal

paulhauner

LGTM! 🚀

paulhauner · 2022-05-20T04:29:09Z

bors r+

## Issue Addressed Upcoming spec change ethereum/consensus-specs#2878 ## Proposed Changes 1. Run fork choice at the start of every slot, and wait for this run to complete before proposing a block. 2. As an optimisation, also run fork choice 3/4 of the way through the slot (at 9s), _dequeueing attestations for the next slot_. 3. Remove the fork choice run from the state advance timer that occurred before advancing the state. ## Additional Info ### Block Proposal Accuracy This change makes us more likely to propose on top of the correct head in the presence of re-orgs with proposer boost in play. The main scenario that this change is designed to address is described in the linked spec issue. ### Attestation Accuracy This change _also_ makes us more likely to attest to the correct head. Currently in the case of a skipped slot at `slot` we only run fork choice 9s into `slot - 1`. This means the attestations from `slot - 1` aren't taken into consideration, and any boost applied to the block from `slot - 1` is not removed (it should be). In the language of the linked spec issue, this means we are liable to attest to C, even when the majority voting weight has already caused a re-org to B. ### Why remove the call before the state advance? If we've run fork choice at the start of the slot then it has already dequeued all the attestations from the previous slot, which are the only ones eligible to influence the head in the current slot. Running fork choice again is unnecessary (unless we run it for the next slot and try to pre-empt a re-org, but I don't currently think this is a great idea). ### Performance Based on Prater testing this adds about 5-25ms of runtime to block proposal times, which are 500-1000ms on average (and spike to 5s+ sometimes due to state handling issues 😢 ). I believe this is a small enough penalty to enable it by default, with the option to disable it via the new flag `--fork-choice-before-proposal-timeout 0`. Upcoming work on block packing and state representation will also reduce block production times in general, while removing the spikes. ### Implementation Fork choice gets invoked at the start of the slot via the `per_slot_task` function called from the slot timer. It then uses a condition variable to signal to block production that fork choice has been updated. This is a bit funky, but it seems to work. One downside of the timer-based approach is that it doesn't happen automatically in most of the tests. The test added by this PR has to trigger the run manually.

bors · 2022-05-20T04:36:58Z

Build failed (retrying...):

cargo-udeps

## Issue Addressed Upcoming spec change ethereum/consensus-specs#2878 ## Proposed Changes 1. Run fork choice at the start of every slot, and wait for this run to complete before proposing a block. 2. As an optimisation, also run fork choice 3/4 of the way through the slot (at 9s), _dequeueing attestations for the next slot_. 3. Remove the fork choice run from the state advance timer that occurred before advancing the state. ## Additional Info ### Block Proposal Accuracy This change makes us more likely to propose on top of the correct head in the presence of re-orgs with proposer boost in play. The main scenario that this change is designed to address is described in the linked spec issue. ### Attestation Accuracy This change _also_ makes us more likely to attest to the correct head. Currently in the case of a skipped slot at `slot` we only run fork choice 9s into `slot - 1`. This means the attestations from `slot - 1` aren't taken into consideration, and any boost applied to the block from `slot - 1` is not removed (it should be). In the language of the linked spec issue, this means we are liable to attest to C, even when the majority voting weight has already caused a re-org to B. ### Why remove the call before the state advance? If we've run fork choice at the start of the slot then it has already dequeued all the attestations from the previous slot, which are the only ones eligible to influence the head in the current slot. Running fork choice again is unnecessary (unless we run it for the next slot and try to pre-empt a re-org, but I don't currently think this is a great idea). ### Performance Based on Prater testing this adds about 5-25ms of runtime to block proposal times, which are 500-1000ms on average (and spike to 5s+ sometimes due to state handling issues 😢 ). I believe this is a small enough penalty to enable it by default, with the option to disable it via the new flag `--fork-choice-before-proposal-timeout 0`. Upcoming work on block packing and state representation will also reduce block production times in general, while removing the spikes. ### Implementation Fork choice gets invoked at the start of the slot via the `per_slot_task` function called from the slot timer. It then uses a condition variable to signal to block production that fork choice has been updated. This is a bit funky, but it seems to work. One downside of the timer-based approach is that it doesn't happen automatically in most of the tests. The test added by this PR has to trigger the run manually.

michaelsproul · 2022-05-20T04:46:59Z

bors r-

gotta fix udeps

bors · 2022-05-20T04:47:01Z

Canceled.

michaelsproul · 2022-05-20T05:01:57Z

bors r+

## Issue Addressed Upcoming spec change ethereum/consensus-specs#2878 ## Proposed Changes 1. Run fork choice at the start of every slot, and wait for this run to complete before proposing a block. 2. As an optimisation, also run fork choice 3/4 of the way through the slot (at 9s), _dequeueing attestations for the next slot_. 3. Remove the fork choice run from the state advance timer that occurred before advancing the state. ## Additional Info ### Block Proposal Accuracy This change makes us more likely to propose on top of the correct head in the presence of re-orgs with proposer boost in play. The main scenario that this change is designed to address is described in the linked spec issue. ### Attestation Accuracy This change _also_ makes us more likely to attest to the correct head. Currently in the case of a skipped slot at `slot` we only run fork choice 9s into `slot - 1`. This means the attestations from `slot - 1` aren't taken into consideration, and any boost applied to the block from `slot - 1` is not removed (it should be). In the language of the linked spec issue, this means we are liable to attest to C, even when the majority voting weight has already caused a re-org to B. ### Why remove the call before the state advance? If we've run fork choice at the start of the slot then it has already dequeued all the attestations from the previous slot, which are the only ones eligible to influence the head in the current slot. Running fork choice again is unnecessary (unless we run it for the next slot and try to pre-empt a re-org, but I don't currently think this is a great idea). ### Performance Based on Prater testing this adds about 5-25ms of runtime to block proposal times, which are 500-1000ms on average (and spike to 5s+ sometimes due to state handling issues 😢 ). I believe this is a small enough penalty to enable it by default, with the option to disable it via the new flag `--fork-choice-before-proposal-timeout 0`. Upcoming work on block packing and state representation will also reduce block production times in general, while removing the spikes. ### Implementation Fork choice gets invoked at the start of the slot via the `per_slot_task` function called from the slot timer. It then uses a condition variable to signal to block production that fork choice has been updated. This is a bit funky, but it seems to work. One downside of the timer-based approach is that it doesn't happen automatically in most of the tests. The test added by this PR has to trigger the run manually.

bors · 2022-05-20T06:57:12Z

Pull request successfully merged into unstable.

Build succeeded:

@pawanjay176

*This PR was adapted from @pawanjay176's work in #3197.* ## Issue Addressed Fixes a regression in #3168 ## Proposed Changes #3168 added calls to `fork_choice` in `BeaconChain::per_slot_task` function. This leads to a panic as `per_slot_task` is called from an async context which calls fork choice, which then calls `block_on`. This PR changes the timer to call the `per_slot_task` function in a blocking thread. Co-authored-by: Pawan Dhananjay <[email protected]>

@paulhauner

## Issue Addressed Fixes an issue that @paulhauner found with the v2.3.0 release candidate whereby the fork choice runs introduced by #3168 tripped over each other during sync: ``` May 24 23:06:40.542 WARN Error signalling fork choice waiter slot: 3884129, error: ForkChoiceSignalOutOfOrder { current: Slot(3884131), latest: Slot(3884129) }, service: beacon ``` This can occur because fork choice is called from the state advance _and_ the per-slot task. When one of these runs takes a long time it can end up finishing after a run from a later slot, tripping the error above. The problem is resolved by not running either of these fork choice calls during sync. Additionally, these parallel fork choice runs were causing issues in the database: ``` May 24 07:49:05.098 WARN Found a chain that should already have been pruned, head_slot: 92925, head_block_root: 0xa76c7bf1b98e54ed4b0d8686efcfdf853484e6c2a4c67e91cbf19e5ad1f96b17, service: beacon May 24 07:49:05.101 WARN Database migration failed error: HotColdDBError(FreezeSlotError { current_split_slot: Slot(92608), proposed_split_slot: Slot(92576) }), service: beacon ``` In this case, two fork choice calls triggering the finalization processing were being processed out of order due to differences in their processing time, causing the background migrator to try to advance finalization _backwards_ :flushed:. Removing the parallel fork choice runs from sync effectively addresses the issue, because these runs are most likely to have different finalized checkpoints (because of the speed at which fork choice advances during sync). In theory it's still possible to process updates out of order if any other fork choice runs end up completing out of order, but this should be much less common. Fixing out of order fork choice runs in general is difficult as it requires architectural changes like serialising fork choice updates through a single thread, or locking fork choice along with the head when it is mutated (#3175). ## Proposed Changes * Don't run per-slot fork choice during sync (if head is older than 4 slots) * Don't run state-advance fork choice during sync (if head is older than 4 slots) * Check for monotonic finalization updates in the background migrator. This is a good defensive check to have, and I'm not sure why we didn't have it before (we may have had it and wrongly removed it).

michaelsproul added the work-in-progress PR is a work-in-progress label May 5, 2022

Run fork choice before block proposal

a247b35

michaelsproul force-pushed the fork-choice-before-proposal branch from 8a8eb08 to a247b35 Compare May 6, 2022 02:06

michaelsproul added 4 commits May 6, 2022 17:08

Run fork choice at 9s into slot

711a36f

Add CLI flag to disable the feature

76e5bd6

Add a test

9c5d16b

Clarify and simplify

28e8114

Run fork choice at slot start

1aa3a96

michaelsproul added 2 commits May 17, 2022 11:35

Implement dynamic timeout via CLI

a9553ce

Merge remote-tracking branch 'origin/unstable' into fork-choice-befor…

3ecc863

…e-proposal

Remove inaccurate comments

a61cc22

paulhauner self-requested a review May 17, 2022 06:18

michaelsproul added ready-for-review The code is ready for review and removed work-in-progress PR is a work-in-progress labels May 17, 2022

michaelsproul added 2 commits May 19, 2022 14:43

Run fork choice 500ms before next slot

06908ad

Allow early block proposals

9808754

paulhauner mentioned this pull request May 20, 2022

[Merged by Bors] - Remove build status badge from README #3195

Closed

michaelsproul added 2 commits May 20, 2022 13:50

Use notify_all

fb648f5

Merge remote-tracking branch 'origin/unstable' into fork-choice-befor…

a83d576

…e-proposal

paulhauner approved these changes May 20, 2022

View reviewed changes

michaelsproul added ready-for-merge This PR is ready to merge. and removed ready-for-review The code is ready for review labels May 20, 2022

bors bot changed the title ~~Run fork choice before block proposal~~ [Merged by Bors] - Run fork choice before block proposal May 20, 2022

bors bot closed this May 20, 2022

michaelsproul deleted the fork-choice-before-proposal branch May 20, 2022 07:03

michaelsproul mentioned this pull request May 20, 2022

Beacon Node not updating new timezone and not syncing #3196

Closed

pawanjay176 mentioned this pull request May 20, 2022

Call per_slot_task from a blocking thread #3197

Closed

paulhauner mentioned this pull request May 20, 2022

[Merged by Bors] - Call per_slot_task from a blocking thread (v2) #3199

Closed

michaelsproul mentioned this pull request May 25, 2022

[Merged by Bors] - Avoid parallel fork choice runs during sync #3217

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Merged by Bors] - Run fork choice before block proposal #3168

[Merged by Bors] - Run fork choice before block proposal #3168

michaelsproul commented May 5, 2022 •

edited

Loading

michaelsproul commented May 10, 2022 •

edited

Loading

michaelsproul commented May 10, 2022 •

edited

Loading

michaelsproul commented May 17, 2022

michaelsproul commented May 17, 2022

michaelsproul commented May 19, 2022

paulhauner left a comment

paulhauner commented May 20, 2022

bors bot commented May 20, 2022

michaelsproul commented May 20, 2022

bors bot commented May 20, 2022

michaelsproul commented May 20, 2022

bors bot commented May 20, 2022

[Merged by Bors] - Run fork choice before block proposal #3168

[Merged by Bors] - Run fork choice before block proposal #3168

Conversation

michaelsproul commented May 5, 2022 • edited Loading

Issue Addressed

Proposed Changes

Additional Info

Block Proposal Accuracy

Attestation Accuracy

Why remove the call before the state advance?

Performance

Implementation

michaelsproul commented May 10, 2022 • edited Loading

michaelsproul commented May 10, 2022 • edited Loading

michaelsproul commented May 17, 2022

Case 1: Healthy chain, blocks every slot

Case 2: Skip slot at n

Case 3: Re-org from n to n - 1

Case 4: Re-org from n - 1 to n

Case 5: Late block in slot n

michaelsproul commented May 17, 2022

michaelsproul commented May 19, 2022

paulhauner left a comment

Choose a reason for hiding this comment

paulhauner commented May 20, 2022

bors bot commented May 20, 2022

michaelsproul commented May 20, 2022

bors bot commented May 20, 2022

michaelsproul commented May 20, 2022

bors bot commented May 20, 2022

michaelsproul commented May 5, 2022 •

edited

Loading

michaelsproul commented May 10, 2022 •

edited

Loading

michaelsproul commented May 10, 2022 •

edited

Loading

Case 2: Skip slot at `n`

Case 3: Re-org from `n` to `n - 1`

Case 4: Re-org from `n - 1` to `n`

Case 5: Late block in slot `n`