Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

op-node: Restore previous unsafe chain when invalid span batch #8925

Conversation

pcw109550
Copy link
Contributor

@pcw109550 pcw109550 commented Jan 10, 2024

Description

Preliminary: Unsafe Head Reorg while Span Batch Derivation

Unsafe head reorg occurs when derivation result from batch differs from previously known unsafe chain. If every L2 attributes included in derived result from batch is valid, that will be the new unsafe/safe block.

If invalid L2 attribute is included in the derived results, safe head will not advance. On the other hand, unsafe head may advance to the middle of span batches. In short, we have below invariants:

  • Safe head update is atomic: Only update when every L2 attributes from batch is valid.
  • Unsafe head update is not atomic: Unsafe block can advance even if batch contains invalid L2 attributes.

I have written some examples to illustrate the behavior.

Example 1: Partially Invalid Span Batch

image

When deriving B2 safe which is a new block, engine queue calls fork choice update to update unsafe head to B2. After that, B3 is a invalid block, so leftover attributes is dropped. Final chain state will be the form:

image

As we can see, we lost A2, A3, A4 and A5.

Example 2: Totally Invalid Span Batch / Totally Invalid Singular Batch

image

When deriving B1 safe which is a new block, engine queue immediately drops entire batches. In this case, unsafe head is never altered, staying atA5. Final state will be the form:

image

Upper final state will be achieved when digesting invalid singular batch. Every singular batch derivation results in single attribute, so there is no partial invalid state. Example case:

image

Issue: Losing Unsafe Blocks which May be Eventually Canonical Hurts!

During happy path sync, distance between unsafe head and safe head may be drastically larger. In this case, if partially invalid span batch(like Example 1) is consumed by rollup node, we will lose all synced unsafe blocks.

Lets focus more at Example 1. We lost A2, A3, A4 and A5 because of invalid span batch. The digested span batch is invalid, so A2, A3, A4 and A5 is more likely to become a canonical safe chain, compared to B2.

Proposed Solution: Restore Previous unsafe head when Invalid Span Batch

I introduce new field backupUnsafeHead and needFCUCallForBackupUnsafeReorg to EngineController(which is embedded in engine queue).

If current known unsafe head's block number is larger or equal to valid attributes which is valid but results in different L2 attributes, initialize backupUnsafeHead to current known unsafe head. It will be a backup which will be used when invalid L2 block is detected. When invalid L2 block is detected, needFCUCallForBackupUnsafeReorg will be set to true.

image

Added TryBackupUnsafeReorg method which is similar to InsertUnsafePayload method. The method tries to call fork choice update(lets abbreviate this to FCU) without payloadAttributes to restore unsafeHead to backupUnsafe. FCU can return these responses: Spec.

  • If err is returned and it is not an InputError, it may be an error(network error/server fault) and retry.
  • If err is returned and it is an InputError, do not retry and forget about backupUnsafe.
  • If err is not returned, but payloadStatus is not VALID, forget about backupUnsafe.

Similar to https://github.com/ethereum-optimism/optimism/blob/develop/specs/derivation.md#l1-sync-payload-attributes-processing, but there is no reset.

If FCU returns VALID payloadStatus, this means execution engine successfully restored(reorged) unsafe head using backupUnsafe. Only update rollup node's state when FCU returns VALID. In other cases, forget about backupUnsafe.

When TryBackupUnsafeReorg performs a network call, it fully consumes single step of engine queue. This design pattern is similar with method TryUpdateEngine.

One thing that I would like to discuss is that, if execution engine did not return VALID payloadStatus, is it okay to simply progress engine queue? Is there any things to do, like cleanup? Maybe calling FCU with original unsafe head(not backupUnsafe) will be needed.

Tests

Added three e2e tests.

TestBackupUnsafe

Same situation introduced at Example 1. Unsafe head will be restored back to A5, instead of B2.

TestBackupUnsafeReorgForkChoiceInputError

Same situation introduced at Example 1. However, execution engine is mocked to return InputError when FCU is called. In this case, there will be no retry and no restoration will be done.

TestBackupUnsafeReorgForkChoiceNotInputError

Same situation introduced at Example 1. However, execution engine is mocked to return an error which is not InputError when FCU is called. In this case, there will be retry and eventually restoration will be done, reorging back unsafe head to A5.

Additional context

This is not a consensus change, but just need a node implementation update.

Partially invalid batches will be posted very unlikely. It is very hard to make a partially invalid batch unless there is a derivation/batching bug.

@pcw109550 pcw109550 marked this pull request as ready for review January 10, 2024 09:01
Copy link
Contributor

coderabbitai bot commented Jan 10, 2024

Walkthrough

Walkthrough

This update enhances error handling in L2 RPC failure simulation, introduces a new method for unsafe L2 block backup, and improves reorganization logic in the engine. It enables specifying errors in L2 RPC failure simulations, adds backup functionality for unsafe L2 blocks, and integrates new logic for managing backup states and reorganizations within the engine controller and queue, enhancing system resilience and debugging capabilities.

Changes

Files Change Summary
.../actions/l2_engine.go
.../actions/l2_engine_test.go
Updated ActL2RPCFail to accept an error parameter and modified tests to check for specific errors.
.../actions/l2_verifier.go Added L2BackupUnsafe method for unsafe L2 block backup.
.../actions/sync_test.go Introduced new tests for unsafe block handling and backup restoration.
.../rollup/derive/engine_controller.go
.../rollup/derive/engine_queue.go
Added fields and methods for managing backup unsafe heads and reorg logic, including tracking, setting, and attempting backup unsafe reorgs. Integrated these functionalities into the engine's state management and queue logic.

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share

Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>.
    • Generate unit-tests for this file.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit tests for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai generate interesting stats about this repository from git and render them as a table.
    • @coderabbitai show all the console.log statements in this repository.
    • @coderabbitai read src/utils.ts and generate unit tests.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger a review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • The JSON schema for the configuration file is available here.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/coderabbit-overrides.v2.json

CodeRabbit Discord Community

Join our Discord Community to get help, request features, and share feedback.

@tynes
Copy link
Contributor

tynes commented Jan 10, 2024

These images and some of the text would be great to have added to the specs

Copy link
Contributor

@trianglesphere trianglesphere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the flow is as follows:

  1. We force a span-batch-middle for the first time. This is where we should save the old unsafe head
  2. We potentially insert several more span batches
  3. The span batch fails. Then we should re-org back to the old unsafe head.

A couple pointers as this code is rapidly changing. We should put as much engine manipulation inside engine controller as possible and have a small set of external methods. It should limit the amount of internal state it is exposing.

I'm going to be putting up a couple more PRs to keep cleaning it up, but if we need to change the interface between the engine queue & the engine controller, we should.

op-node/rollup/derive/engine_controller.go Show resolved Hide resolved
@pcw109550 pcw109550 force-pushed the tip/restore-unsafe-chain-while-flip-flop-reorg branch from 9709ec0 to bd08911 Compare January 15, 2024 06:46
@pcw109550
Copy link
Contributor Author

I think that the flow is as follows:

  1. We force a span-batch-middle for the first time. This is where we should save the old unsafe head
  2. We potentially insert several more span batches
  3. The span batch fails. Then we should re-org back to the old unsafe head.

A couple pointers as this code is rapidly changing. We should put as much engine manipulation inside engine controller as possible and have a small set of external methods. It should limit the amount of internal state it is exposing.

I'm going to be putting up a couple more PRs to keep cleaning it up, but if we need to change the interface between the engine queue & the engine controller, we should.

For step 2, we potentially insert several more L2 attributes from a single span batch, not several more span batches. Other than that, the flow is right.

I have noticed that other engine queue related changes are in progress, like #8966 and #8968. Will rebase when these are merged. I agree that we should limit internal state exposure of engine controller.

@pcw109550
Copy link
Contributor Author

These images and some of the text would be great to have added to the specs

Current L1-sync: payload attributes processing specs does not mention about span batches. They are intended to be working as span-batch agnostic.

The diagram and the text explains the current engine queue implementation about L1-sync, but not the spec. So this assets may be added to the tech docs, not the specs. What do you think? Also, are there any public tech docs which explains current implementations?

@trianglesphere
Copy link
Contributor

@pcw109550 more merge conflicts, but I think they're getting close to the end

@pcw109550 pcw109550 force-pushed the tip/restore-unsafe-chain-while-flip-flop-reorg branch 2 times, most recently from cadb3d4 to 91427e7 Compare January 21, 2024 15:33
@pcw109550
Copy link
Contributor Author

@trianglesphere rebased since engine queue altering changes: #8966, #8968 are all merged. PTAL

Copy link
Contributor

@axelKingsley axelKingsley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a very fun review to look over, thank you very much :). I really appreciated your explanations in the description, and diagrams, they were very helpful.

The concept of this PR, backing up the unsafe head to avoid getting stuck in a span-batch, makes sense. The implementation looks sound too, it becomes one of the functionalities in Step. I am not sure if it should be the first functionality checked, but it makes sense that if there's a reorg to be done, it'd probably take precedence over other work.

I left a few comments around log messages and tests, and then one larger one around the patterning of the TryBackupUnsafeReorg function itself.

op-e2e/actions/l2_engine_test.go Outdated Show resolved Hide resolved
op-node/rollup/derive/engine_controller.go Outdated Show resolved Hide resolved
op-node/rollup/derive/engine_controller.go Outdated Show resolved Hide resolved
op-node/rollup/derive/engine_controller.go Show resolved Hide resolved
op-node/rollup/derive/engine_queue.go Outdated Show resolved Hide resolved
op-e2e/actions/sync_test.go Outdated Show resolved Hide resolved
op-e2e/actions/sync_test.go Outdated Show resolved Hide resolved
op-e2e/actions/sync_test.go Outdated Show resolved Hide resolved
op-e2e/actions/sync_test.go Outdated Show resolved Hide resolved
op-node/rollup/derive/engine_controller.go Outdated Show resolved Hide resolved
op-node/rollup/derive/engine_controller.go Outdated Show resolved Hide resolved
@pcw109550 pcw109550 force-pushed the tip/restore-unsafe-chain-while-flip-flop-reorg branch 2 times, most recently from 641f7d4 to c4778d8 Compare January 30, 2024 07:23
Copy link
Contributor

This PR is stale because it has been open 14 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Feb 14, 2024
@pcw109550
Copy link
Contributor Author

@trianglesphere @axelKingsley @tynes May I please ask for review?

@github-actions github-actions bot removed the Stale label Feb 15, 2024
Copy link
Contributor

@trianglesphere trianglesphere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've got a couple small comments & then we can get this PR over the line

op-e2e/actions/sync_test.go Outdated Show resolved Hide resolved
op-node/rollup/derive/engine_queue.go Outdated Show resolved Hide resolved
@pcw109550 pcw109550 force-pushed the tip/restore-unsafe-chain-while-flip-flop-reorg branch from c4778d8 to d86a8d4 Compare March 1, 2024 18:58
op-node/rollup/derive/engine_controller.go Show resolved Hide resolved
op-node/rollup/derive/engine_controller.go Show resolved Hide resolved
op-node/rollup/derive/engine_queue.go Show resolved Hide resolved
op-node/rollup/derive/engine_queue.go Show resolved Hide resolved
op-node/rollup/derive/engine_queue.go Show resolved Hide resolved
op-e2e/actions/sync_test.go Show resolved Hide resolved
@pcw109550 pcw109550 requested a review from trianglesphere March 1, 2024 19:37
Copy link
Contributor

@trianglesphere trianglesphere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tyvm

@trianglesphere trianglesphere enabled auto-merge March 7, 2024 01:00
@trianglesphere trianglesphere added this pull request to the merge queue Mar 7, 2024
Merged via the queue into ethereum-optimism:develop with commit 9cccd6b Mar 7, 2024
65 checks passed
@trianglesphere trianglesphere deleted the tip/restore-unsafe-chain-while-flip-flop-reorg branch March 7, 2024 01:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants