op-node: Restore previous unsafe chain when invalid span batch #8925

pcw109550 · 2024-01-10T08:56:06Z

Description

Preliminary: Unsafe Head Reorg while Span Batch Derivation

Unsafe head reorg occurs when derivation result from batch differs from previously known unsafe chain. If every L2 attributes included in derived result from batch is valid, that will be the new unsafe/safe block.

If invalid L2 attribute is included in the derived results, safe head will not advance. On the other hand, unsafe head may advance to the middle of span batches. In short, we have below invariants:

Safe head update is atomic: Only update when every L2 attributes from batch is valid.
Unsafe head update is not atomic: Unsafe block can advance even if batch contains invalid L2 attributes.

I have written some examples to illustrate the behavior.

Example 1: Partially Invalid Span Batch

When deriving B2 safe which is a new block, engine queue calls fork choice update to update unsafe head to B2. After that, B3 is a invalid block, so leftover attributes is dropped. Final chain state will be the form:

As we can see, we lost A2, A3, A4 and A5.

Example 2: Totally Invalid Span Batch / Totally Invalid Singular Batch

When deriving B1 safe which is a new block, engine queue immediately drops entire batches. In this case, unsafe head is never altered, staying atA5. Final state will be the form:

Upper final state will be achieved when digesting invalid singular batch. Every singular batch derivation results in single attribute, so there is no partial invalid state. Example case:

Issue: Losing Unsafe Blocks which May be Eventually Canonical Hurts!

During happy path sync, distance between unsafe head and safe head may be drastically larger. In this case, if partially invalid span batch(like Example 1) is consumed by rollup node, we will lose all synced unsafe blocks.

Lets focus more at Example 1. We lost A2, A3, A4 and A5 because of invalid span batch. The digested span batch is invalid, so A2, A3, A4 and A5 is more likely to become a canonical safe chain, compared to B2.

Proposed Solution: Restore Previous unsafe head when Invalid Span Batch

I introduce new field backupUnsafeHead and needFCUCallForBackupUnsafeReorg to EngineController(which is embedded in engine queue).

If current known unsafe head's block number is larger or equal to valid attributes which is valid but results in different L2 attributes, initialize backupUnsafeHead to current known unsafe head. It will be a backup which will be used when invalid L2 block is detected. When invalid L2 block is detected, needFCUCallForBackupUnsafeReorg will be set to true.

Added TryBackupUnsafeReorg method which is similar to InsertUnsafePayload method. The method tries to call fork choice update(lets abbreviate this to FCU) without payloadAttributes to restore unsafeHead to backupUnsafe. FCU can return these responses: Spec.

If err is returned and it is not an InputError, it may be an error(network error/server fault) and retry.
If err is returned and it is an InputError, do not retry and forget about backupUnsafe.
If err is not returned, but payloadStatus is not VALID, forget about backupUnsafe.

Similar to https://github.com/ethereum-optimism/optimism/blob/develop/specs/derivation.md#l1-sync-payload-attributes-processing, but there is no reset.

If FCU returns VALID payloadStatus, this means execution engine successfully restored(reorged) unsafe head using backupUnsafe. Only update rollup node's state when FCU returns VALID. In other cases, forget about backupUnsafe.

When TryBackupUnsafeReorg performs a network call, it fully consumes single step of engine queue. This design pattern is similar with method TryUpdateEngine.

One thing that I would like to discuss is that, if execution engine did not return VALID payloadStatus, is it okay to simply progress engine queue? Is there any things to do, like cleanup? Maybe calling FCU with original unsafe head(not backupUnsafe) will be needed.

Tests

Added three e2e tests.

`TestBackupUnsafe`

Same situation introduced at Example 1. Unsafe head will be restored back to A5, instead of B2.

`TestBackupUnsafeReorgForkChoiceInputError`

Same situation introduced at Example 1. However, execution engine is mocked to return InputError when FCU is called. In this case, there will be no retry and no restoration will be done.

`TestBackupUnsafeReorgForkChoiceNotInputError`

Same situation introduced at Example 1. However, execution engine is mocked to return an error which is not InputError when FCU is called. In this case, there will be retry and eventually restoration will be done, reorging back unsafe head to A5.

Additional context

This is not a consensus change, but just need a node implementation update.

Partially invalid batches will be posted very unlikely. It is very hard to make a partially invalid batch unless there is a derivation/batching bug.

coderabbitai · 2024-01-10T09:01:38Z

Walkthrough

This update enhances error handling in L2 RPC failure simulation, introduces a new method for unsafe L2 block backup, and improves reorganization logic in the engine. It enables specifying errors in L2 RPC failure simulations, adds backup functionality for unsafe L2 blocks, and integrates new logic for managing backup states and reorganizations within the engine controller and queue, enhancing system resilience and debugging capabilities.

Changes

Files	Change Summary
`.../actions/l2_engine.go` `.../actions/l2_engine_test.go`	Updated `ActL2RPCFail` to accept an error parameter and modified tests to check for specific errors.
`.../actions/l2_verifier.go`	Added `L2BackupUnsafe` method for unsafe L2 block backup.
`.../actions/sync_test.go`	Introduced new tests for unsafe block handling and backup restoration.
`.../rollup/derive/engine_controller.go` `.../rollup/derive/engine_queue.go`	Added fields and methods for managing backup unsafe heads and reorg logic, including tracking, setting, and attempting backup unsafe reorgs. Integrated these functionalities into the engine's state management and queue logic.

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share

Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>.
- Generate unit-tests for this file.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit tests for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai generate interesting stats about this repository from git and render them as a table.
- @coderabbitai show all the console.log statements in this repository.
- @coderabbitai read src/utils.ts and generate unit tests.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger a review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
The JSON schema for the configuration file is available here.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/coderabbit-overrides.v2.json

CodeRabbit Discord Community

Join our Discord Community to get help, request features, and share feedback.

op-node/rollup/derive/engine_controller.go

tynes · 2024-01-10T14:28:49Z

These images and some of the text would be great to have added to the specs

trianglesphere

I think that the flow is as follows:

We force a span-batch-middle for the first time. This is where we should save the old unsafe head
We potentially insert several more span batches
The span batch fails. Then we should re-org back to the old unsafe head.

A couple pointers as this code is rapidly changing. We should put as much engine manipulation inside engine controller as possible and have a small set of external methods. It should limit the amount of internal state it is exposing.

I'm going to be putting up a couple more PRs to keep cleaning it up, but if we need to change the interface between the engine queue & the engine controller, we should.

op-node/rollup/derive/engine_controller.go

op-node/rollup/derive/engine_queue.go

op-node/rollup/derive/engine_controller.go

pcw109550 · 2024-01-15T07:00:35Z

I think that the flow is as follows:

We force a span-batch-middle for the first time. This is where we should save the old unsafe head

We potentially insert several more span batches

The span batch fails. Then we should re-org back to the old unsafe head.

A couple pointers as this code is rapidly changing. We should put as much engine manipulation inside engine controller as possible and have a small set of external methods. It should limit the amount of internal state it is exposing.

I'm going to be putting up a couple more PRs to keep cleaning it up, but if we need to change the interface between the engine queue & the engine controller, we should.

For step 2, we potentially insert several more L2 attributes from a single span batch, not several more span batches. Other than that, the flow is right.

I have noticed that other engine queue related changes are in progress, like #8966 and #8968. Will rebase when these are merged. I agree that we should limit internal state exposure of engine controller.

pcw109550 · 2024-01-15T07:34:54Z

These images and some of the text would be great to have added to the specs

Current L1-sync: payload attributes processing specs does not mention about span batches. They are intended to be working as span-batch agnostic.

The diagram and the text explains the current engine queue implementation about L1-sync, but not the spec. So this assets may be added to the tech docs, not the specs. What do you think? Also, are there any public tech docs which explains current implementations?

trianglesphere · 2024-01-19T00:48:08Z

@pcw109550 more merge conflicts, but I think they're getting close to the end

pcw109550 · 2024-01-21T16:02:51Z

@trianglesphere rebased since engine queue altering changes: #8966, #8968 are all merged. PTAL

axelKingsley

This was a very fun review to look over, thank you very much :). I really appreciated your explanations in the description, and diagrams, they were very helpful.

The concept of this PR, backing up the unsafe head to avoid getting stuck in a span-batch, makes sense. The implementation looks sound too, it becomes one of the functionalities in Step. I am not sure if it should be the first functionality checked, but it makes sense that if there's a reorg to be done, it'd probably take precedence over other work.

I left a few comments around log messages and tests, and then one larger one around the patterning of the TryBackupUnsafeReorg function itself.

op-e2e/actions/l2_engine_test.go

op-node/rollup/derive/engine_controller.go

op-node/rollup/derive/engine_queue.go

op-e2e/actions/sync_test.go

op-node/rollup/derive/engine_controller.go

github-actions · 2024-02-14T01:46:06Z

This PR is stale because it has been open 14 days with no activity. Remove stale label or comment or this will be closed in 5 days.

pcw109550 · 2024-02-14T06:02:03Z

@trianglesphere @axelKingsley @tynes May I please ask for review?

trianglesphere

I've got a couple small comments & then we can get this PR over the line

op-e2e/actions/sync_test.go

op-node/rollup/derive/engine_queue.go

Test are run concurrently so accessing shared global object is problematic

op-node/rollup/derive/engine_controller.go

op-node/rollup/derive/engine_queue.go

op-e2e/actions/sync_test.go

trianglesphere

tyvm

pcw109550 marked this pull request as ready for review January 10, 2024 09:01

pcw109550 requested review from protolambda, trianglesphere, ajsutton and a team as code owners January 10, 2024 09:01

coderabbitai bot reviewed Jan 10, 2024

View reviewed changes

op-node/rollup/derive/engine_controller.go Outdated Show resolved Hide resolved

trianglesphere reviewed Jan 10, 2024

View reviewed changes

op-node/rollup/derive/engine_controller.go Show resolved Hide resolved

tynes reviewed Jan 11, 2024

View reviewed changes

op-node/rollup/derive/engine_controller.go Outdated Show resolved Hide resolved

tynes reviewed Jan 11, 2024

View reviewed changes

op-node/rollup/derive/engine_queue.go Outdated Show resolved Hide resolved

tynes reviewed Jan 11, 2024

View reviewed changes

op-node/rollup/derive/engine_controller.go Outdated Show resolved Hide resolved

pcw109550 force-pushed the tip/restore-unsafe-chain-while-flip-flop-reorg branch from 9709ec0 to bd08911 Compare January 15, 2024 06:46

pcw109550 force-pushed the tip/restore-unsafe-chain-while-flip-flop-reorg branch 2 times, most recently from cadb3d4 to 91427e7 Compare January 21, 2024 15:33

pcw109550 requested a review from trianglesphere January 21, 2024 16:02

axelKingsley reviewed Jan 23, 2024

View reviewed changes

coderabbitai bot reviewed Jan 30, 2024

View reviewed changes

pcw109550 force-pushed the tip/restore-unsafe-chain-while-flip-flop-reorg branch 2 times, most recently from 641f7d4 to c4778d8 Compare January 30, 2024 07:23

pcw109550 requested review from tynes and axelKingsley January 30, 2024 08:07

github-actions bot added the Stale label Feb 14, 2024

github-actions bot removed the Stale label Feb 15, 2024

trianglesphere reviewed Feb 23, 2024

View reviewed changes

op-e2e/actions/sync_test.go Outdated Show resolved Hide resolved

op-node/rollup/derive/engine_queue.go Outdated Show resolved Hide resolved

pcw109550 added 11 commits March 1, 2024 11:33

op-node: Restore previous unsafe chain using backupUnsafe

52416be

op-e2e: Enable custom error while mocking L2 RPC error

79a7e38

op-e2e: Add BackupUnsafe tests

92b9a57

op-node: Fix comment

a592c2a

op-node: Follow convention for backup unsafe head metric

8f8784a

op-e2e: Fix BackupUnsafe tests

c0f05d6

op-node: Tailered/Consistent log message

b387b4b

op-e2e: Better coding style

bffdeec

op-node: Refactor code for trying backupUnsafe reorg

3176f3d

op-node: Better variable name

da64a2b

op-e2e: Remove global variable

d86a8d4

Test are run concurrently so accessing shared global object is problematic

pcw109550 force-pushed the tip/restore-unsafe-chain-while-flip-flop-reorg branch from c4778d8 to d86a8d4 Compare March 1, 2024 18:58

coderabbitai bot reviewed Mar 1, 2024

View reviewed changes

pcw109550 requested a review from trianglesphere March 1, 2024 19:37

trianglesphere approved these changes Mar 7, 2024

View reviewed changes

trianglesphere enabled auto-merge March 7, 2024 01:00

trianglesphere added this pull request to the merge queue Mar 7, 2024

Merged via the queue into ethereum-optimism:develop with commit 9cccd6b Mar 7, 2024
65 checks passed

trianglesphere deleted the tip/restore-unsafe-chain-while-flip-flop-reorg branch March 7, 2024 01:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

op-node: Restore previous unsafe chain when invalid span batch #8925

op-node: Restore previous unsafe chain when invalid span batch #8925

pcw109550 commented Jan 10, 2024 •

edited

Loading

coderabbitai bot commented Jan 10, 2024 •

edited

Loading

Walkthrough

Changes

Chat

CodeRabbit Commands (invoked as PR comments)

CodeRabbit Configration File (`.coderabbit.yaml`)

CodeRabbit Discord Community

tynes commented Jan 10, 2024

trianglesphere left a comment

pcw109550 commented Jan 15, 2024

pcw109550 commented Jan 15, 2024

trianglesphere commented Jan 19, 2024

pcw109550 commented Jan 21, 2024

axelKingsley left a comment

github-actions bot commented Feb 14, 2024

pcw109550 commented Feb 14, 2024

trianglesphere left a comment

trianglesphere left a comment

op-node: Restore previous unsafe chain when invalid span batch #8925

op-node: Restore previous unsafe chain when invalid span batch #8925

Conversation

pcw109550 commented Jan 10, 2024 • edited Loading

Preliminary: Unsafe Head Reorg while Span Batch Derivation

Example 1: Partially Invalid Span Batch

Example 2: Totally Invalid Span Batch / Totally Invalid Singular Batch

Issue: Losing Unsafe Blocks which May be Eventually Canonical Hurts!

Proposed Solution: Restore Previous unsafe head when Invalid Span Batch

TestBackupUnsafe

TestBackupUnsafeReorgForkChoiceInputError

TestBackupUnsafeReorgForkChoiceNotInputError

coderabbitai bot commented Jan 10, 2024 • edited Loading

Walkthrough

Changes

Chat

CodeRabbit Commands (invoked as PR comments)

CodeRabbit Configration File (.coderabbit.yaml)

CodeRabbit Discord Community

tynes commented Jan 10, 2024

trianglesphere left a comment

Choose a reason for hiding this comment

pcw109550 commented Jan 15, 2024

pcw109550 commented Jan 15, 2024

trianglesphere commented Jan 19, 2024

pcw109550 commented Jan 21, 2024

axelKingsley left a comment

Choose a reason for hiding this comment

github-actions bot commented Feb 14, 2024

pcw109550 commented Feb 14, 2024

trianglesphere left a comment

Choose a reason for hiding this comment

trianglesphere left a comment

Choose a reason for hiding this comment

pcw109550 commented Jan 10, 2024 •

edited

Loading

`TestBackupUnsafe`

`TestBackupUnsafeReorgForkChoiceInputError`

`TestBackupUnsafeReorgForkChoiceNotInputError`

coderabbitai bot commented Jan 10, 2024 •

edited

Loading

CodeRabbit Configration File (`.coderabbit.yaml`)