Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add e2hs file format spec #368

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

njgheorghita
Copy link
Contributor

This is a proposal for a new file storage format to be used in the History network. The goals of this format...

  • unify pre & post merge data into a single format to simplify bridging logic
  • include proofs, so bridges don't need to manage overhead of generating proofs everytime they perform gossip
  • make post-merge receipts available, since no available era file formats contain receipts

This format will require additional architecture to generate new e2hs files for each finalized epoch, as well as some bridge work to support this format.

Maybe the only concern I can think of is that the new era files will be based on a portal-defined type, rather than more "established" native ethereum types. so, let's say we want to change HeaderWithProof at some point in the future, we will have to re-generate all of the files (though, imo after the recent union removal, this type has solidified).

Any feedback on the specifics or any pushback on using such a format would be great, there might be other short-comings to this idea that I haven't noticed.

BlockTuple := CompressedHeader | CompressedBody | CompressedReceipts
-----
Version := {type: 0x3265, data: nil}
CompressedHeader := {type: 0x03, data: snappyFramed(rlp(HeaderWithProof))}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is a new type it should use a unique type number

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I thought, but then I saw that era1 and e2ss use the same version number, so I wasn't sure what to do. But I agree it should be something unique

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the only type that is reused is 0x03 for snappyFramed(rlp(header)), but it's the same for all 3 existing types: era, era1 and e2ss.

You define new type snappyFramed(rlp(HeaderWithProof)), so it should be different.

CompressedHeader := {type: 0x03, data: snappyFramed(rlp(HeaderWithProof))}
CompressedBody := {type: 0x04, data: snappyFramed(rlp(BlockBody))}
CompressedReceipts := {type: 0x05, data: snappyFramed(rlp(Receipts))}
Accumulator := {type: 0x06, data: hash_tree_root(List(block_hash, 8192))}
Copy link
Member

@KolbyML KolbyML Feb 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would look into the shortfalls of historical roots and summaries, Beacon slots can be empty, so I am not sure if this ridge requirement would work for the other accumulators, so I am assuming this accumulator section is under specified?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tbh I'm not entirely sure what the best approach is for this field. I'm not even sure it's necessary, since all the headers will have accompanying proofs that can be directly verified. I need to double-check exactly how it's handled, but even in era files they have a complete list of block_roots even if there are missed slots, so we can probably just copy their approach for handling missed slots. But I'll dig into this

@KolbyML
Copy link
Member

KolbyML commented Feb 21, 2025

Then my final question to everyone is does this spec belong in this repository. Here is a link which contains links to the specs of the 4 other pre-existing e2store formats https://eth-clients.github.io/history-endpoints/

Most links are to PR's to projects, or lined commit links. So I am not sure if this should belong in the portal-network-specs repo, if anything the specs should be consolidated into a new repo for e2store formats @arnetheduck @kdeme what do you guys think? There was also talks of better formalizing the types to avoid number reuse. I am interested in what people think on the matter.

@njgheorghita
Copy link
Contributor Author

Also, just noting here that this spec will also need to handle the final, pre-merge truncated epoch, so that post-merge e2hs files align with era file boundaries

Copy link
Contributor

@morph-dev morph-dev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I'm in favor of this direction

BlockTuple := CompressedHeader | CompressedBody | CompressedReceipts
-----
Version := {type: 0x3265, data: nil}
CompressedHeader := {type: 0x03, data: snappyFramed(rlp(HeaderWithProof))}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the only type that is reused is 0x03 for snappyFramed(rlp(header)), but it's the same for all 3 existing types: era, era1 and e2ss.

You define new type snappyFramed(rlp(HeaderWithProof)), so it should be different.

BlockTuple := CompressedHeader | CompressedBody | CompressedReceipts
-----
Version := {type: 0x3265, data: nil}
CompressedHeader := {type: 0x03, data: snappyFramed(rlp(HeaderWithProof))}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would use snappyFramed(ssz(HeaderWithProof)), because we don't encode HeaderWithProof with rlp anywhere at the moment (but we do ssz encoding). And even if you would like to rlp encode it, i think it wouldn't work well.

I think BlockBody and Receipts are currently encoded both with rlp and ssz, depending on the context. But if we encode HeaderWithProof with ssz, I would also encode body and receitps with it as well (for consistency).


```
e2hs := Version | BlockTuple* | OtherEntry* | Accumulator | BlockIndex
BlockTuple := CompressedHeader | CompressedBody | CompressedReceipts
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in comparison to era1 format, we lost total_difficulty. Is reasoning that we don't need it now that proofs are already part of the CompressedHeader?
Would we need it in some other context?

The file format is defined as:

```
e2hs := Version | BlockTuple* | OtherEntry* | Accumulator | BlockIndex
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is OtherEntry?

@morph-dev
Copy link
Contributor

Also, just noting here that this spec will also need to handle the final, pre-merge truncated epoch, so that post-merge e2hs files align with era file boundaries

Why do we have to align e2gs with era file boundaries? My understanding is that era is aligned with 8192 slots, not blocks.

So if we want one clean type, primarily for portal network usage, I would suggest that:

  • each file will have exactly 8192 blocks (not slots), potentially spanning over critical fork threshold (e.g merge)
  • encode everything as ssz
  • we remove accumulator and maybe some other fields if they are not needed (e.g. BlockIndex can be just starting block number)

Basically, it would be each of the HistoryContentValue types, encoded as they are in portal-spec (ssz). And potentially some other meta-data, like starting index, total_difficulty, etc.

@njgheorghita
Copy link
Contributor Author

Why do we have to align e2hs with era file boundaries? My understanding is that era is aligned with 8192 slots, not blocks.

Ahh, that's a fair point. The only strong argument I can see for "aligning" with era files is to deal with HEAD - 8192 - or non-ephemeral latest (aka we want to be able to generate a new e2hs file as soon as each epoch is finalized). My assumption is that we want to gossip this content out to the network asap (specifically HeaderWithProofs as the Bodies & Receipts will already be gossiped). But, taking your comment into account... There is no way to "align" with era files 100% unless we switch to a slot-based period post-merge (not a good idea imo). And, in terms of handling the "latest" available HeaderWithProofs... we can just handle these cases in our "latest" bridge (idk, maybe expand the "latest" bridge to gossip all ephemeral data & the latest finalized epoch HeaderWithProofs), rather than force this storage format to accommodate the edge cases. A clean, simple storage format does seem preferable.

I'm ok with ssz-ing everything, I think @KolbyML might have preferred rlp-ing receipts/bodies in a side chat?

I'm a little unclear in my understanding of the purpose of the BlockIndex. In geth's era1 spec:

BlockIndex stores relative offsets to each compressed block entry.

Which I would understand as an index for each block is necessary for indexing directly to a specific block without iterating over the whole file. Though, looking at our era1 tooling, we don't really take advantage of this at all, and deser the whole file before indexing into a specific block tuple. This seems to work fine in our use case, but maybe it's worth leaving the indices in and improving our lookup logic?

I can't think of any reason why the Accumulator is necessary for our purposes. Each header already has an accompanying proof. It's kind of nice to have a hash in the filename, maybe this helps guard against people downloading "fake" e2hs files, but I don't think that's so true, since anyone can use a valid hash in the filename and invalid data in the file. Maybe we can update the hash in the file name to be a hash of the entire file contents? so it can be verified after being downloaded?

@KolbyML
Copy link
Member

KolbyML commented Feb 25, 2025

I'm ok with ssz-ing everything, I think @KolbyML might have preferred rlp-ing receipts/bodies in a side chat?

I just want to do whatever is less work. The nice thing about rlp is we know the format will never change. But maybe we will change our format for ssz encoding them?

I am fine either way, ssz-ing everything just seemed like it was more work, for little to no gain. But if there is a gain I think we should do it, it wasn't apparent to me there would be value, expessially as

image

Our ssz bodies and receipts are just wrappers around rlp, it seemed pretty pointless, when we already have the infrastructure in place, why create custom types for something where there is already a good solution i.e. why reinvent the wheel when nothing is broken.

For HeaderWithProof it makes way more sense for it to be in ssz, as we don't have code to encode/decode rlp for it. HeaderWithProof is far more ssz native, hence why I said what I did

@morph-dev
Copy link
Contributor

Regarding BlockIndex, you are right. I just forgot their use case. But we should keep them in.

Regarding rlp vs ssz, I'm not strongly set one way or the other... The benefits of ssz:

  • Consistency between types
  • If they are ssz, they should be identical to HistoryContentValue. And can potentially be gossiped (seems to be the main use case) directly, even without decoding

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants