integrate blobs with read pipeline #49

hannahhoward · 2024-04-10T15:49:01Z

We need to make gateway and hoverboard support blobs stored in the system. Here is a current rough plan

We should provide ipni/offer handler that will perform writes to bridge with current system
- Agree on the payload format for the ipni/offer
- Perform writes to claims table that content claims read from
- Write do dynamo index of blocks that hoverboard reads from
- Extend content claims format to support multihash based addressing
- Update carpark to remove CAR assumptions (e.g. get by multihash)
- Wire with gateway (I could use some help here as I got lost trying to figure out how do we actually read)
- Produce equivalency claims for roundabout so that SPs can read piece segments
- Enable Roundabout to read Blobs based on equivalency claims

Tasks

implement index/add handler w3up#1401
HTTP Read Interface
- Wire index/add handler to write index that freeway reads from w3up#1404
  - Wire index/add handler to write derived DUDEWHERE index w3up#1402
  - Wire index/add handler to write derived SATNAV index w3up#1403
Bitswap Read Interface
- Wire index/add handler to write derived hoverboard block level index w3up#1405
- Wire index/add handler with IPNI publisher-lambda w3up#1406
Filecoin SP Read interface (Roundabout)
Filecoin Pipeline Read Interface
- filecoin-api storefront reads Blob bytes to derive Piece CID from anywhere w3up#1349

The text was updated successfully, but these errors were encountered:

alanshaw · 2024-04-11T15:08:32Z

@vasco-santos says: Can you confirm that while we assume carpark-prod-0 in freeway, freeway will simply work? as in, this is the process pre making read interfaces using content claims by default without assuming buckets

Unfortunately no...not yet. It uses partition and inclusion claims to do the work that dudewhere/satnav were providing. When it knows the parts (shards) and the index info it assumes the location is the R2 bucket.

In local.freeway I tweaked it to consider location claims, so we have the code to do what we need, it's just not in freeway yet.

hannahhoward · 2024-04-15T16:23:14Z

@alanshaw + @Gozala figuring this out, needs IPNI work too -- may be blocking. 4/15/2024

vasco-santos · 2024-04-23T15:07:22Z

Synced with @alanshaw earlier today on this, in order to understand what is the planned path for integration between blobs with reads pipeline.

Write side

client will provide the bundle with the format spec'ed out via ipni/offer
ipni/offer server handler will write into old E-IPFS dynamodb block level indexing with well known domain like https://w3s.link/blob/${encodedMultihash} + byte range.

Open questions:

is the well known domain going to include query parameters to encode bucket/origin info?
should the well known domain be a resolvable domain?

Read side

On the reads side, there are multiple interfaces, so let's see what would be the plan for each:

Hoverboard
- will read from E-IPFS dynamoDB table where given block is stored, perform HTTP request to get this block (or use a R2 client binding to do so to avoid double egress billing) and serve it back
Freeway/w3s.link
- we will look into a completely different approach from what we do today. For context, today we have an index of available shards (CAR files) per RootCID (dudewhere), and when we receive request for a given RootCID, we get the CARv2 side indexes of each CAR and extract all the needed blocks in order to serve the root content / path
- we will use a different approach, which actually would be something more along the lines of what Hoverboard clients would do. Once a RootCID is requested, the block location is found, bytes are fetched, and therefore we traverse the dag getting all the blocks needed.
Roundabout
- currently serves content via redirect to presigned URLs on a R2 bucket to optimize egress costs.
- Requests come from Filecoin SPs with PieceCID. We need in first place to be able to fetch equivalency claim of what equivalent blob is the requested PieceCID. Once we know the blob, we need to return a presigned URL for the location of this Blob (Bucket Name, URL, Key, etc).

Open questions:

how much work is needed on Freeway to shift to new planned approach?
how much do we want to rely on Content Claims given future plans? Do we want to hook read interfaces completely to the dynamo table? or to content claims with materialised claims from the dynamo table?
can we write a lib that abstracts this read problem in a way that both freeway and hoverboard can just rely on? as follow up, we can consider alternative optimizations specific for freeway use case

Compromises

Other "clients" of blobs that are not specific read interfaces, but either application layer or Filecoin dependencies will need to be compromised on first iteration. Tickets should be created to address these issues.

`upload/add`

currently upload/add receives a RootCID and an array of CAR Links
given we write the client, we can ship first iteration assuming CARs
🎯 we can assume CAR Links and put CAR Links as shards, so that upload/add does not need to change

`filecoin/offer`

similarly to upload/add, filecoin/offer also expects a link https://github.com/w3s-project/w3up/blob/main/packages/capabilities/src/filecoin/storefront.js#L60 instead of a multihash.
🎯 we can assume CAR for first iteration and check this later

`assert/equal`

equivalency claims are currently critical on the Filecoin Pipeline, more specifically in Roundabout like previously described.
currently Storefront is responsible for issuing equivalency claims once Filecoin Pieces are validated on filecoin/submit context.
🎯 given equivalency claims capabilities are also expecting Links at the moment instead of multihash, we can assume CAR codec for first iteration, or make claims be able to accept Blob and check this later

Gozala · 2024-04-23T20:34:35Z

I have attempted to capture all the read points in the following document https://hackmd.io/@gozala/idx-publishing/edit

I also made a visual map of the read pipeline here https://www.tldraw.com/s/v2_c_rU_WdpZ_BFEhY5VOspdE7?v=-564%2C-230%2C3028%2C2127&p=page

Gozala · 2024-04-23T20:56:14Z

I would propose implementing a unified index read / write interface so that we could switch all of the read interfaces. This should reduce a complexity of the system and require less contextual knowledge from the contributors. In terms of execution I suggest following plan

Create a unified write interface, most likely it could be an /space/index/add capability handler
- I would expect handler to take a payload per spec and perform writes to
  1. DUDEWHERE index
  2. SATNAV R2 index
  3. Dynamo prod-ep-v1-blocks-cars-position index
  4. Dynamo (NEEDS VERIFICATION) prod-content-claims-claim index
- ❗️Note that we should write block range that spans full blob so that readers can resolve it
Create a unified block level read interface (Probably as a part of content claims rest API)
- Lookup by multihash should return location commitment (or something along those lines)
Rewire gateway so that RAW block read would use ☝️ block level read
Rewire hoverboard so that reads would use ☝️block lever read
Create unified DAG level read interface (Probably as a part of content claims rest API)
- Lookup by CID should return DAG index per spec
- Should accept option to include location commitments and avoid roundtrips (maybe multipart response could be utilized here)
Rewire freeway to utilize ☝️ so that blockstore can be derived from the index
Create handler for publishing equivalency claims (used by roundabout)
- Capability is already defined we just need to implement handler for it
- We could make sub/with field be did:web.web3.storage for now
Update roundabout so that it uses block level read interface to resolve CAR / blob location

alanshaw · 2024-04-24T10:48:48Z

Missing: write to SQS multihashes queue for adding into IPNI

Dynamo (NEEDS VERIFICATION) prod-content-claims-claim index

Interested to know what we'd write here (see further down in this comment for proposal)? Also, I'd encourage not writing directly to the backing store but via the ucanto handler.

❗️Note that we should write block range that spans full blob so that readers can resolve it

Can you clarify? Do you mean we should write an entry for the blob itself, as well as for all the blocks it contains? If yes, then yes I agree 😄 .

Create handler for publishing equivalency claims (used by roundabout)

We have Ucanto handlers for publishing all eixsting kinds of content claims.

The Ucanto service (writes) runs at POST https://claims.web3.storage
The HTTP API (reads) runs at GET https://claims.web3.storage/claims/:cid

Create a unified block level read interface (Probably as a part of content claims rest API)

I'd love this to run via the existing read interface. I put a lot of work into building it and it would be a shame to not see it being used (or adapted).

Lookup by multihash should return location commitment (or something along those lines)

The HTTP API for content claims does this already. i.e. falls back to DynamoDB if no other claims exist. It'll need small tweaks.

Create unified DAG level read interface (Probably as a part of content claims rest API)
Lookup by CID should return DAG index per spec

Note that the spec'd DAG index doesn't really fit well with existing claims. It is more akin to a relation claim.

Proposal for what claims to publish:

Just fitting into existing claims:

Location claim for each shard (CAR CID => URL)
Partition claim for DAG root (DAG CID => CAR CIDs)
Inclusion claim for each shard (CAR CID => DAG Index CID)

It's kinda weird that the inclusion claims all point to the same CID but 🤷‍♂️ .

Alternatievly, we could create a new claim that is an "index" claim?:

assert/index

{
  content: CID /* DAG root */,
  index: CID /* w3-index CID */,
}

So then you'd publish:

Location claim for each shard (CAR CID => URL)
Partition claim for DAG root (DAG CID => CAR CIDs)
Index claim for DAG root (DAG CID => DAG Index CID)

So then you publish fewer claims.

Note: you still need the partition claim so you can have lcoation claims for the shards included in the response (see just below)

Should accept option to include location commitments and avoid roundtrips (maybe multipart response could be utilized here)

It already does this, and even provides options to get related claims via the ?walk= querystring parameter (see API doc). It is not multipart, but it is a CAR response that contains all the claims that match the queried CID.

Rewire freeway to utilize ☝️ so that blockstore can be derived from the index

It already does this, needs minor tweaks.

Update roundabout so that it uses block level read interface to resolve CAR / blob location

If the existing content claims HTTP API is the read interface then there is nothing to do here 😄

Gozala · 2024-04-24T20:59:44Z

Missing: write to SQS multihashes queue for adding into IPNI

Good point I have added storacha/w3up#1406 for this

Can you clarify? Do you mean we should write an entry for the blob itself, as well as for all the blocks it contains? If yes, then yes I agree 😄 .

Yes although we identified problem with a current spec that we'll need to address to do it see

storacha/w3up#1405 (comment)

We have Ucanto handlers for publishing all eixsting kinds of content claims.

The Ucanto service (writes) runs at POST https://claims.web3.storage

The HTTP API (reads) runs at GET https://claims.web3.storage/claims/:cid

Awesome I did discover that it is even used by filecoin pipeline already 😍

Create a unified block level read interface (Probably as a part of content claims rest API)

I'd love this to run via the existing read interface. I put a lot of work into building it and it would be a shame to not see it being used (or adapted).

I want the same thing 🤩

Note that the spec'd DAG index doesn't really fit well with existing claims. It is more akin to a relation claim.

Proposal for what claims to publish:

Just fitting into existing claims:

Location claim for each shard (CAR CID => URL)

Partition claim for DAG root (DAG CID => CAR CIDs)

Inclusion claim for each shard (CAR CID => DAG Index CID)

It's kinda weird that the inclusion claims all point to the same CID but 🤷‍♂️ .

Alternatievly, we could create a new claim that is an "index" claim?:

assert/index
{
  content: CID /* DAG root */,
  index: CID /* w3-index CID */,
}
So then you'd publish:

Location claim for each shard (CAR CID => URL)

Partition claim for DAG root (DAG CID => CAR CIDs)

Index claim for DAG root (DAG CID => DAG Index CID)

So then you publish fewer claims.

We covered this in the call, but short version is that DAG Sharded index supposed to cover both partition and index claims, if something is missing we can extend it as needed to cover it all.

Part of storacha/project-tracking#49 Note that currently Roundabout is used in production traffic for SPs to download Piece bytes, and is planned to be used by w3filecoin storefront to validate a Piece CID. ## SP reads 1. SPs request comes with a PieceCID, where we get equivalency claim for this Piece to some content. 2. In current world (`store/*` protocol), it will in most cases be a CAR CID that we can get from R2 `carpark-prod-0` as `carCid/carCid.car`. However, `store/add` does not really require this to be a CAR, so it could end up being other CIDs that are still stored with same key format in R2 bucket. 3. With new world (`blob/*` protocol), it will be a RAW CID that we can get from R2 `carpark-prod-0` as `b58btc(multihash)/b58btc(multihash).blob`. ## w3filecoin reads 1. `filecoin/offer` is performed with a given content CID 2. In current client world, a `CarCID` is provided on `filecoin/offer`. This CID is used to get bytes for the content, in order to derive Piece for validation. In addition, equivalency claim is issued with `CarCID` 3. With new world, we aim to have `filecoin/offer` to rely on RAW CIDs, which will be used for both reading content and issuing equivalency claims. ## This PR We need a transition period where we support both worlds. This PR enables roundabout to attempt to distinguish between a Blob and a CAR when it gets a retrieval request. If the CID requested is a CAR (or a Piece that equals a CAR), we can assume the old path and key format immediately. On the other hand, if CID requested is RAW, we may need to give back a Blob object or a "CAR" like stored object. For the transition period, this PR proposed that if we have a RAW content to locate, we MUST do a HEAD request to see if a Blob exists, and if so redirect to presigned URL for it. Otherwise, we need to fallback into old key formats. As an alternative, we could make the decision to make `store/add` handler not accept anymore non CAR CIDs, even though we would lose the ability to retrieve old things from Roundabout (which may be fine as well 🤔 ). Please note that this is still not hooked with content claims to figure out which bucket to use, and still relies on assumption of CF R2 `carpark-prod-0`. Just uses equivalency claims to map PieceCID to ContentCID

alanshaw · 2024-05-20T15:14:55Z

Pending deployment:

Freeway uses location claims to read blocks with byte range reqs: chore(main): release 2.18.0 freeway#104
Materialized content claims include readable location URL: feat: use carpark bucket URL in materialized claims for legacy content content-claims#63

In progress PRs:

Switch hoverboard to use content claims: feat!: use content claims hoverboard#26

hannahhoward mentioned this issue Apr 10, 2024

[parent] blob protocol #17

Closed

9 tasks

hannahhoward added this to Storacha Project Planning Apr 10, 2024

hannahhoward moved this to Sprint Backlog in Storacha Project Planning Apr 10, 2024

reidlw changed the title ~~address hash/cid mismatch between blobs and content claims~~ integrate blobs with read pipeline Apr 10, 2024

reidlw assigned Gozala Apr 10, 2024

hannahhoward mentioned this issue Apr 15, 2024

Server side triggering of IPNI publishing #55

Closed

Gozala assigned gammazero and unassigned Gozala Apr 22, 2024

reidlw moved this from Sprint Backlog to In Progress in Storacha Project Planning Apr 24, 2024

vasco-santos mentioned this issue Apr 26, 2024

feat: roundabout gets raw cids as blobs storacha/w3infra#359

Merged

alanshaw self-assigned this Apr 29, 2024

hannahhoward closed this as completed May 30, 2024

github-project-automation bot moved this from In Progress to Done in Storacha Project Planning May 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

integrate blobs with read pipeline #49

integrate blobs with read pipeline #49

hannahhoward commented Apr 10, 2024 •

edited by vasco-santos

Loading

alanshaw commented Apr 11, 2024

hannahhoward commented Apr 15, 2024

vasco-santos commented Apr 23, 2024 •

edited

Loading

Gozala commented Apr 23, 2024

Gozala commented Apr 23, 2024

alanshaw commented Apr 24, 2024 •

edited

Loading

Gozala commented Apr 24, 2024

alanshaw commented May 20, 2024

integrate blobs with read pipeline #49

integrate blobs with read pipeline #49

Comments

hannahhoward commented Apr 10, 2024 • edited by vasco-santos Loading

Tasks

alanshaw commented Apr 11, 2024

hannahhoward commented Apr 15, 2024

vasco-santos commented Apr 23, 2024 • edited Loading

Write side

Read side

Compromises

upload/add

filecoin/offer

assert/equal

Gozala commented Apr 23, 2024

Gozala commented Apr 23, 2024

alanshaw commented Apr 24, 2024 • edited Loading

Gozala commented Apr 24, 2024

alanshaw commented May 20, 2024

hannahhoward commented Apr 10, 2024 •

edited by vasco-santos

Loading

vasco-santos commented Apr 23, 2024 •

edited

Loading

`upload/add`

`filecoin/offer`

`assert/equal`

alanshaw commented Apr 24, 2024 •

edited

Loading