Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

integrate blobs with read pipeline #49

Closed
1 of 8 tasks
Tracked by #17
hannahhoward opened this issue Apr 10, 2024 · 8 comments
Closed
1 of 8 tasks
Tracked by #17

integrate blobs with read pipeline #49

hannahhoward opened this issue Apr 10, 2024 · 8 comments
Assignees

Comments

@hannahhoward
Copy link
Member

hannahhoward commented Apr 10, 2024

We need to make gateway and hoverboard support blobs stored in the system. Here is a current rough plan

  1. We should provide ipni/offer handler that will perform writes to bridge with current system
    • Agree on the payload format for the ipni/offer
    • Perform writes to claims table that content claims read from
    • Write do dynamo index of blocks that hoverboard reads from
    • Extend content claims format to support multihash based addressing
    • Update carpark to remove CAR assumptions (e.g. get by multihash)
    • Wire with gateway (I could use some help here as I got lost trying to figure out how do we actually read)
    • Produce equivalency claims for roundabout so that SPs can read piece segments
    • Enable Roundabout to read Blobs based on equivalency claims

Tasks

@hannahhoward hannahhoward mentioned this issue Apr 10, 2024
9 tasks
@hannahhoward hannahhoward moved this to Sprint Backlog in Storacha Project Planning Apr 10, 2024
@reidlw reidlw changed the title address hash/cid mismatch between blobs and content claims integrate blobs with read pipeline Apr 10, 2024
@alanshaw
Copy link
Member

@vasco-santos says: Can you confirm that while we assume carpark-prod-0 in freeway, freeway will simply work? as in, this is the process pre making read interfaces using content claims by default without assuming buckets

Unfortunately no...not yet. It uses partition and inclusion claims to do the work that dudewhere/satnav were providing. When it knows the parts (shards) and the index info it assumes the location is the R2 bucket.

In local.freeway I tweaked it to consider location claims, so we have the code to do what we need, it's just not in freeway yet.

@hannahhoward
Copy link
Member Author

@alanshaw + @Gozala figuring this out, needs IPNI work too -- may be blocking. 4/15/2024

@vasco-santos
Copy link

vasco-santos commented Apr 23, 2024

Synced with @alanshaw earlier today on this, in order to understand what is the planned path for integration between blobs with reads pipeline.

Write side

  • client will provide the bundle with the format spec'ed out via ipni/offer
  • ipni/offer server handler will write into old E-IPFS dynamodb block level indexing with well known domain like https://w3s.link/blob/${encodedMultihash} + byte range.

Open questions:

  1. is the well known domain going to include query parameters to encode bucket/origin info?
  2. should the well known domain be a resolvable domain?

Read side

On the reads side, there are multiple interfaces, so let's see what would be the plan for each:

  • Hoverboard
    • will read from E-IPFS dynamoDB table where given block is stored, perform HTTP request to get this block (or use a R2 client binding to do so to avoid double egress billing) and serve it back
  • Freeway/w3s.link
    • we will look into a completely different approach from what we do today. For context, today we have an index of available shards (CAR files) per RootCID (dudewhere), and when we receive request for a given RootCID, we get the CARv2 side indexes of each CAR and extract all the needed blocks in order to serve the root content / path
    • we will use a different approach, which actually would be something more along the lines of what Hoverboard clients would do. Once a RootCID is requested, the block location is found, bytes are fetched, and therefore we traverse the dag getting all the blocks needed.
  • Roundabout
    • currently serves content via redirect to presigned URLs on a R2 bucket to optimize egress costs.
    • Requests come from Filecoin SPs with PieceCID. We need in first place to be able to fetch equivalency claim of what equivalent blob is the requested PieceCID. Once we know the blob, we need to return a presigned URL for the location of this Blob (Bucket Name, URL, Key, etc).

Open questions:

  1. how much work is needed on Freeway to shift to new planned approach?
  2. how much do we want to rely on Content Claims given future plans? Do we want to hook read interfaces completely to the dynamo table? or to content claims with materialised claims from the dynamo table?
  3. can we write a lib that abstracts this read problem in a way that both freeway and hoverboard can just rely on? as follow up, we can consider alternative optimizations specific for freeway use case

Compromises

Other "clients" of blobs that are not specific read interfaces, but either application layer or Filecoin dependencies will need to be compromised on first iteration. Tickets should be created to address these issues.

upload/add

  • currently upload/add receives a RootCID and an array of CAR Links
  • given we write the client, we can ship first iteration assuming CARs
  • 🎯 we can assume CAR Links and put CAR Links as shards, so that upload/add does not need to change

filecoin/offer

assert/equal

  • equivalency claims are currently critical on the Filecoin Pipeline, more specifically in Roundabout like previously described.
  • currently Storefront is responsible for issuing equivalency claims once Filecoin Pieces are validated on filecoin/submit context.
  • 🎯 given equivalency claims capabilities are also expecting Links at the moment instead of multihash, we can assume CAR codec for first iteration, or make claims be able to accept Blob and check this later

@Gozala
Copy link

Gozala commented Apr 23, 2024

I have attempted to capture all the read points in the following document https://hackmd.io/@gozala/idx-publishing/edit

I also made a visual map of the read pipeline here https://www.tldraw.com/s/v2_c_rU_WdpZ_BFEhY5VOspdE7?v=-564%2C-230%2C3028%2C2127&p=page

@Gozala
Copy link

Gozala commented Apr 23, 2024

I would propose implementing a unified index read / write interface so that we could switch all of the read interfaces. This should reduce a complexity of the system and require less contextual knowledge from the contributors. In terms of execution I suggest following plan

  1. Create a unified write interface, most likely it could be an /space/index/add capability handler
    • I would expect handler to take a payload per spec and perform writes to
      1. DUDEWHERE index
      2. SATNAV R2 index
      3. Dynamo prod-ep-v1-blocks-cars-position index
      4. Dynamo (NEEDS VERIFICATION) prod-content-claims-claim index
    • ❗️Note that we should write block range that spans full blob so that readers can resolve it
  2. Create a unified block level read interface (Probably as a part of content claims rest API)
    • Lookup by multihash should return location commitment (or something along those lines)
  3. Rewire gateway so that RAW block read would use ☝️ block level read
  4. Rewire hoverboard so that reads would use ☝️block lever read
  5. Create unified DAG level read interface (Probably as a part of content claims rest API)
    • Lookup by CID should return DAG index per spec
    • Should accept option to include location commitments and avoid roundtrips (maybe multipart response could be utilized here)
  6. Rewire freeway to utilize ☝️ so that blockstore can be derived from the index
  7. Create handler for publishing equivalency claims (used by roundabout)
    • Capability is already defined we just need to implement handler for it
    • We could make sub/with field be did:web.web3.storage for now
  8. Update roundabout so that it uses block level read interface to resolve CAR / blob location

@alanshaw
Copy link
Member

alanshaw commented Apr 24, 2024

Missing: write to SQS multihashes queue for adding into IPNI

Dynamo (NEEDS VERIFICATION) prod-content-claims-claim index

Interested to know what we'd write here (see further down in this comment for proposal)? Also, I'd encourage not writing directly to the backing store but via the ucanto handler.

❗️Note that we should write block range that spans full blob so that readers can resolve it

Can you clarify? Do you mean we should write an entry for the blob itself, as well as for all the blocks it contains? If yes, then yes I agree 😄 .

Create handler for publishing equivalency claims (used by roundabout)

We have Ucanto handlers for publishing all eixsting kinds of content claims.

  • The Ucanto service (writes) runs at POST https://claims.web3.storage
  • The HTTP API (reads) runs at GET https://claims.web3.storage/claims/:cid

Create a unified block level read interface (Probably as a part of content claims rest API)

I'd love this to run via the existing read interface. I put a lot of work into building it and it would be a shame to not see it being used (or adapted).

Lookup by multihash should return location commitment (or something along those lines)

The HTTP API for content claims does this already. i.e. falls back to DynamoDB if no other claims exist. It'll need small tweaks.

Create unified DAG level read interface (Probably as a part of content claims rest API)
Lookup by CID should return DAG index per spec

Note that the spec'd DAG index doesn't really fit well with existing claims. It is more akin to a relation claim.

Proposal for what claims to publish:

Just fitting into existing claims:

  • Location claim for each shard (CAR CID => URL)
  • Partition claim for DAG root (DAG CID => CAR CIDs)
  • Inclusion claim for each shard (CAR CID => DAG Index CID)

It's kinda weird that the inclusion claims all point to the same CID but 🤷‍♂️ .

Alternatievly, we could create a new claim that is an "index" claim?:

assert/index

{
  content: CID /* DAG root */,
  index: CID /* w3-index CID */,
}

So then you'd publish:

  • Location claim for each shard (CAR CID => URL)
  • Partition claim for DAG root (DAG CID => CAR CIDs)
  • Index claim for DAG root (DAG CID => DAG Index CID)

So then you publish fewer claims.

Note: you still need the partition claim so you can have lcoation claims for the shards included in the response (see just below)

Should accept option to include location commitments and avoid roundtrips (maybe multipart response could be utilized here)

It already does this, and even provides options to get related claims via the ?walk= querystring parameter (see API doc). It is not multipart, but it is a CAR response that contains all the claims that match the queried CID.

Rewire freeway to utilize ☝️ so that blockstore can be derived from the index

It already does this, needs minor tweaks.

Update roundabout so that it uses block level read interface to resolve CAR / blob location

If the existing content claims HTTP API is the read interface then there is nothing to do here 😄

@reidlw reidlw moved this from Sprint Backlog to In Progress in Storacha Project Planning Apr 24, 2024
@Gozala
Copy link

Gozala commented Apr 24, 2024

Missing: write to SQS multihashes queue for adding into IPNI

Good point I have added storacha/w3up#1406 for this

Can you clarify? Do you mean we should write an entry for the blob itself, as well as for all the blocks it contains? If yes, then yes I agree 😄 .

Yes although we identified problem with a current spec that we'll need to address to do it see

storacha/w3up#1405 (comment)

We have Ucanto handlers for publishing all eixsting kinds of content claims.

  • The Ucanto service (writes) runs at POST https://claims.web3.storage
  • The HTTP API (reads) runs at GET https://claims.web3.storage/claims/:cid

Awesome I did discover that it is even used by filecoin pipeline already 😍

Create a unified block level read interface (Probably as a part of content claims rest API)

I'd love this to run via the existing read interface. I put a lot of work into building it and it would be a shame to not see it being used (or adapted).

I want the same thing 🤩

Note that the spec'd DAG index doesn't really fit well with existing claims. It is more akin to a relation claim.

Proposal for what claims to publish:

Just fitting into existing claims:

  • Location claim for each shard (CAR CID => URL)
  • Partition claim for DAG root (DAG CID => CAR CIDs)
  • Inclusion claim for each shard (CAR CID => DAG Index CID)

It's kinda weird that the inclusion claims all point to the same CID but 🤷‍♂️ .

Alternatievly, we could create a new claim that is an "index" claim?:

assert/index

{
  content: CID /* DAG root */,
  index: CID /* w3-index CID */,
}

So then you'd publish:

  • Location claim for each shard (CAR CID => URL)
  • Partition claim for DAG root (DAG CID => CAR CIDs)
  • Index claim for DAG root (DAG CID => DAG Index CID)

So then you publish fewer claims.

We covered this in the call, but short version is that DAG Sharded index supposed to cover both partition and index claims, if something is missing we can extend it as needed to cover it all.

vasco-santos added a commit to storacha/w3infra that referenced this issue Apr 29, 2024
Part of storacha/project-tracking#49

Note that currently Roundabout is used in production traffic for SPs to
download Piece bytes, and is planned to be used by w3filecoin storefront
to validate a Piece CID.

## SP reads

1. SPs request comes with a PieceCID, where we get equivalency claim for
this Piece to some content.
2. In current world (`store/*` protocol), it will in most cases be a CAR
CID that we can get from R2 `carpark-prod-0` as `carCid/carCid.car`.
However, `store/add` does not really require this to be a CAR, so it
could end up being other CIDs that are still stored with same key format
in R2 bucket.
3. With new world (`blob/*` protocol), it will be a RAW CID that we can
get from R2 `carpark-prod-0` as
`b58btc(multihash)/b58btc(multihash).blob`.

## w3filecoin reads

1. `filecoin/offer` is performed with a given content CID
2. In current client world, a `CarCID` is provided on `filecoin/offer`.
This CID is used to get bytes for the content, in order to derive Piece
for validation. In addition, equivalency claim is issued with `CarCID`
3. With new world, we aim to have `filecoin/offer` to rely on RAW CIDs,
which will be used for both reading content and issuing equivalency
claims.

## This PR

We need a transition period where we support both worlds. 

This PR enables roundabout to attempt to distinguish between a Blob and
a CAR when it gets a retrieval request. If the CID requested is a CAR
(or a Piece that equals a CAR), we can assume the old path and key
format immediately. On the other hand, if CID requested is RAW, we may
need to give back a Blob object or a "CAR" like stored object.

For the transition period, this PR proposed that if we have a RAW
content to locate, we MUST do a HEAD request to see if a Blob exists,
and if so redirect to presigned URL for it. Otherwise, we need to
fallback into old key formats. As an alternative, we could make the
decision to make `store/add` handler not accept anymore non CAR CIDs,
even though we would lose the ability to retrieve old things from
Roundabout (which may be fine as well 🤔 ).

Please note that this is still not hooked with content claims to figure
out which bucket to use, and still relies on assumption of CF R2
`carpark-prod-0`. Just uses equivalency claims to map PieceCID to
ContentCID
@alanshaw alanshaw self-assigned this Apr 29, 2024
@alanshaw
Copy link
Member

Pending deployment:

In progress PRs:

@github-project-automation github-project-automation bot moved this from In Progress to Done in Storacha Project Planning May 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

No branches or pull requests

5 participants