Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible error re withdrawals empty or missing in rlp-decoded body #26647

Closed
holiman opened this issue Feb 9, 2023 · 5 comments
Closed

Possible error re withdrawals empty or missing in rlp-decoded body #26647

holiman opened this issue Feb 9, 2023 · 5 comments
Labels

Comments

@holiman
Copy link
Contributor

holiman commented Feb 9, 2023

Received via discord from @michaelsproul. CC @fjl


I think I may have identified a small Geth bug. I'm not sure though because my Go-fu is terrible so I haven't even tried to look at the code.

Out of the box, you can't sync Lighthouse+Geth on Zhejiang at the moment. Several users reported this, and I reproduced it just now.

The error that Geth reports is

ERROR[02-09|14:28:02.610] Beacon backfilling failed                err="retrieved hash chain is invalid: missing withdrawals in block body"
WARN [02-09|14:28:12.188] Marked new chain head as invalid         hash=530ef8..150093 badnumber=42314 badhash=1c5f2c..b07418

(full error: https://gist.github.com/michaelsproul/9082b5763f7cb90044a646f10aefeb38)

My current guess is that Geth is deserializing withdrawals: nil instead of withdrawals: [] when decoding blocks from devp2p. The block hash check passes (because nil and [] are RLP-equivalent, right?) but then Geth realises that there are no withdrawals when there should be based on the timestamp. So a valid block hash gets marked invalid and the chain breaks.

Nodes following the chain since genesis won't have had this issue because they'll have got a JSON execution payload straight from the CL. The reason it happens more often with Lighthouse is that we have an optimisation that skips the newPayload message while syncing, so we don't drip feed every payload to the EL (forcing the EL to download its own payloads). This feature can be turned off with --disable-optimistic-finalized-sync, and indeed syncing Lighthouse-Geth with this flag doesn't trigger the issue.

@holiman
Copy link
Contributor Author

holiman commented Feb 9, 2023

It seems to be ok. The rlp optional in geth behaves as I had expected, either

  ],
  [],
  [],
]

two empty lists ( txs, receipts) for blocks that are lacking withdrawals (pre-shanghai), or

  [],
  [],
  [],
]

Three empty lists, if the withdrawals are present but empty. No ambiguity on the rlp level.
I'll look into the decoder a bit more

@MariusVanDerWijden
Copy link
Member

MariusVanDerWijden commented Feb 9, 2023

Go unit test for the decoder, also confirms that the rlp decoding in geth looks okay

func TestWithdrawalEncoding(t *testing.T) {
	check := func(f string, got, want interface{}) {
		if !reflect.DeepEqual(got, want) {
			t.Errorf("%s mismatch: got %v, want %v", f, got, want)
		}
	}

	header := &Header{
		WithdrawalsHash: &EmptyRootHash,
	}
	withdrawals := make([]*Withdrawal, 0)
	block := NewBlock(header, nil, nil, nil, nil).WithWithdrawals(withdrawals)

	encRLP, err := rlp.EncodeToBytes(block)
	if err != nil {
		t.Fatal(err)
	}

	var decoded Block
	if err := rlp.DecodeBytes(encRLP, &decoded); err != nil {
		t.Fatal("decode error: ", err)
	}

	check("withdrawalsHash", decoded.header.WithdrawalsHash, header.WithdrawalsHash)
	if decoded.withdrawals == nil {
		panic("asdf")
	}
}

@michaelsproul
Copy link

michaelsproul commented Feb 10, 2023

I added some logging and indeed it seems to be a header with the correct hash, and a block body with nil withdrawals:

ERROR[02-10|13:23:52.785] Beacon backfilling failed err="retrieved hash chain is invalid: missing withdrawals in block body. withdrawals hash is 56e81f171bcc55a6ff8345e692c0f86e5b48e01b996cadc001622fb5e363b421, header hash is 1c5f2c9cbaa2e7c7abdb6fa73e7562beea21ce6af8a0e52a5634f6283db07418"

the code change:

+++ b/core/block_validator.go
@@ -71,7 +71,7 @@ func (v *BlockValidator) ValidateBody(block *types.Block) error {
        if header.WithdrawalsHash != nil {
                // Withdrawals list must be present in body after Shanghai.
                if block.Withdrawals() == nil {
-                       return fmt.Errorf("missing withdrawals in block body")
+                       return fmt.Errorf("missing withdrawals in block body. withdrawals hash is %x, header hash is %x", header.WithdrawalsHash, header.Hash())
                }
                if hash := types.DeriveSha(block.Withdrawals(), trie.NewStackTrie(nil)); hash != *header.WithdrawalsHash {
                        return fmt.Errorf("withdrawals root hash mismatch (header value %x, calculated %x)", *header.WithdrawalsHash, hash)

I also enabled debug logs to see which client was sending the faulty block body.

Just after the error occurred I had only 1 connected peer:

ERROR[02-10|13:23:52.785] Beacon backfilling failed err="retrieved hash chain is invalid: missing withdrawals in block body but hash is 56e81f171bcc55a6ff8345e692c0f86e5b48e01b996cadc001622fb5e363b421, header hash is 1c5f2c9cbaa2e7c7abdb6fa73e7562beea21ce6af8a0e52a5634f6283db07418"
INFO [02-10|13:23:53.224] Looking for peers peercount=1 tried=123 static=0

And just prior to that I see a request to peer d3185b8457361940 for block bodies:

DEBUG[02-10|13:23:52.701] Fetching batch of block bodies id=d3185b8457361940 conn=dyndial count=128

And right back at node start up the peer identification is Geth/v1.11.0 (not sure how to un-truncate the name to get the commit hash, but I suspect this could be one of the Zhejiang bootnodes).

Ethereum peer connected id=d3185b8457361940 conn=dyndial name=Geth/v1.11.0-unstabl...

Maybe the issue isn't actually a bug in Geth's serialisation/deserialisation but in the handling of dodgy data from misconfigured/malicious peers. I tried to poke around to find what validation is applied to block bodies after fetching headers, and AFAICT it only happens in the insertIterator here:

return it.chain[it.index], it.validator.ValidateBody(it.chain[it.index])

At that point, it seems the downloader expects bodies to be valid and won't recover if ValidateBody returns an error. This comment seems to suggest that at least:

// Downloaded blocks are always regarded as trusted after the
// transition. Because the downloaded chain is guided by the
// consensus-layer.

I went looking for other body validation and found the logic in DeliverBodies, but was confused when my debugging statements didn't show up for the offending block in question (42314). The debug logs I added are in this commit and the debug logs from a recent run with that commit are here. Maybe this omission is significant, as it seems to imply the logic here is being skipped?

if header.WithdrawalsHash == nil {
// discard any withdrawals if we don't have a withdrawal hash set
withdrawalLists[index] = nil
} else if withdrawalListHashes[index] != *header.WithdrawalsHash {
return errInvalidBody
}

All in all I'm thoroughly confused about what's going on, but can consistently reproduce this issue syncing Lighthouse + Geth on Zhejiang. It just takes about 20-30 minutes each time after nuking Geth's DB and re-syncing it. Someone more experienced with Geth would probably have an easier time debugging it.

@MariusVanDerWijden
Copy link
Member

@michaelsproul I think Mario has found the issue to be in EthereumJS. One problem is that both [] and nil hash to the same emptyRootHash, so we compute the header hash correctly even if the withdrawals are wrong.

One fix that we can do is to manually set the withdrawals list to [] if we have an emptyRootHash in the header (and the withdrawals are nil). This would fix it on our side. Either way ethereumjs needs to fix their encoding

@holiman
Copy link
Contributor Author

holiman commented Feb 15, 2023

Fixed by #26675

@holiman holiman closed this as completed Feb 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants