-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
op-node: Restore previous unsafe chain when invalid span batch #8925
op-node: Restore previous unsafe chain when invalid span batch #8925
Conversation
WalkthroughWalkthroughThis update enhances error handling in L2 RPC failure simulation, introduces a new method for unsafe L2 block backup, and improves reorganization logic in the engine. It enables specifying errors in L2 RPC failure simulations, adds backup functionality for unsafe L2 blocks, and integrates new logic for managing backup states and reorganizations within the engine controller and queue, enhancing system resilience and debugging capabilities. Changes
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (invoked as PR comments)
Additionally, you can add CodeRabbit Configration File (
|
These images and some of the text would be great to have added to the specs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the flow is as follows:
- We force a span-batch-middle for the first time. This is where we should save the old unsafe head
- We potentially insert several more span batches
- The span batch fails. Then we should re-org back to the old unsafe head.
A couple pointers as this code is rapidly changing. We should put as much engine manipulation inside engine controller as possible and have a small set of external methods. It should limit the amount of internal state it is exposing.
I'm going to be putting up a couple more PRs to keep cleaning it up, but if we need to change the interface between the engine queue & the engine controller, we should.
9709ec0
to
bd08911
Compare
For step 2, we potentially insert several more L2 attributes from a single span batch, not several more span batches. Other than that, the flow is right. I have noticed that other engine queue related changes are in progress, like #8966 and #8968. Will rebase when these are merged. I agree that we should limit internal state exposure of engine controller. |
Current L1-sync: payload attributes processing specs does not mention about span batches. They are intended to be working as span-batch agnostic. The diagram and the text explains the current engine queue implementation about L1-sync, but not the spec. So this assets may be added to the tech docs, not the specs. What do you think? Also, are there any public tech docs which explains current implementations? |
@pcw109550 more merge conflicts, but I think they're getting close to the end |
cadb3d4
to
91427e7
Compare
@trianglesphere rebased since engine queue altering changes: #8966, #8968 are all merged. PTAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a very fun review to look over, thank you very much :). I really appreciated your explanations in the description, and diagrams, they were very helpful.
The concept of this PR, backing up the unsafe head to avoid getting stuck in a span-batch, makes sense. The implementation looks sound too, it becomes one of the functionalities in Step
. I am not sure if it should be the first functionality checked, but it makes sense that if there's a reorg to be done, it'd probably take precedence over other work.
I left a few comments around log messages and tests, and then one larger one around the patterning of the TryBackupUnsafeReorg
function itself.
641f7d4
to
c4778d8
Compare
This PR is stale because it has been open 14 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
@trianglesphere @axelKingsley @tynes May I please ask for review? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've got a couple small comments & then we can get this PR over the line
Test are run concurrently so accessing shared global object is problematic
c4778d8
to
d86a8d4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tyvm
9cccd6b
Description
Preliminary: Unsafe Head Reorg while Span Batch Derivation
Unsafe head reorg occurs when derivation result from batch differs from previously known unsafe chain. If every L2 attributes included in derived result from batch is valid, that will be the new unsafe/safe block.
If invalid L2 attribute is included in the derived results, safe head will not advance. On the other hand, unsafe head may advance to the middle of span batches. In short, we have below invariants:
I have written some examples to illustrate the behavior.
Example 1: Partially Invalid Span Batch
When deriving
B2
safe which is a new block, engine queue calls fork choice update to update unsafe head toB2
. After that,B3
is a invalid block, so leftover attributes is dropped. Final chain state will be the form:As we can see, we lost
A2
,A3
,A4
andA5
.Example 2: Totally Invalid Span Batch / Totally Invalid Singular Batch
When deriving
B1
safe which is a new block, engine queue immediately drops entire batches. In this case, unsafe head is never altered, staying atA5
. Final state will be the form:Upper final state will be achieved when digesting invalid singular batch. Every singular batch derivation results in single attribute, so there is no partial invalid state. Example case:
Issue: Losing Unsafe Blocks which May be Eventually Canonical Hurts!
During happy path sync, distance between unsafe head and safe head may be drastically larger. In this case, if partially invalid span batch(like Example 1) is consumed by rollup node, we will lose all synced unsafe blocks.
Lets focus more at Example 1. We lost
A2
,A3
,A4
andA5
because of invalid span batch. The digested span batch is invalid, soA2
,A3
,A4
andA5
is more likely to become a canonical safe chain, compared toB2
.Proposed Solution: Restore Previous unsafe head when Invalid Span Batch
I introduce new field
backupUnsafeHead
andneedFCUCallForBackupUnsafeReorg
toEngineController
(which is embedded in engine queue).If current known unsafe head's block number is larger or equal to valid attributes which is valid but results in different L2 attributes, initialize
backupUnsafeHead
to current known unsafe head. It will be a backup which will be used when invalid L2 block is detected. When invalid L2 block is detected,needFCUCallForBackupUnsafeReorg
will be set to true.Added
TryBackupUnsafeReorg
method which is similar toInsertUnsafePayload
method. The method tries to call fork choice update(lets abbreviate this to FCU) withoutpayloadAttributes
to restore unsafeHead tobackupUnsafe
. FCU can return these responses: Spec.InputError
, it may be an error(network error/server fault) and retry.InputError
, do not retry and forget aboutbackupUnsafe
.VALID
, forget aboutbackupUnsafe
.Similar to https://github.com/ethereum-optimism/optimism/blob/develop/specs/derivation.md#l1-sync-payload-attributes-processing, but there is no reset.
If FCU returns
VALID
payloadStatus, this means execution engine successfully restored(reorged) unsafe head usingbackupUnsafe
. Only update rollup node's state when FCU returnsVALID
. In other cases, forget aboutbackupUnsafe
.When
TryBackupUnsafeReorg
performs a network call, it fully consumes single step of engine queue. This design pattern is similar with methodTryUpdateEngine
.One thing that I would like to discuss is that, if execution engine did not return
VALID
payloadStatus, is it okay to simply progress engine queue? Is there any things to do, like cleanup? Maybe calling FCU with original unsafe head(not backupUnsafe) will be needed.Tests
Added three e2e tests.
TestBackupUnsafe
Same situation introduced at Example 1. Unsafe head will be restored back to
A5
, instead ofB2
.TestBackupUnsafeReorgForkChoiceInputError
Same situation introduced at Example 1. However, execution engine is mocked to return
InputError
when FCU is called. In this case, there will be no retry and no restoration will be done.TestBackupUnsafeReorgForkChoiceNotInputError
Same situation introduced at Example 1. However, execution engine is mocked to return an error which is not
InputError
when FCU is called. In this case, there will be retry and eventually restoration will be done, reorging back unsafe head toA5
.Additional context
This is not a consensus change, but just need a node implementation update.
Partially invalid batches will be posted very unlikely. It is very hard to make a partially invalid batch unless there is a derivation/batching bug.