Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: GraphSync integration #6208

Closed
hannahhoward opened this issue Apr 11, 2019 · 9 comments · Fixed by #9747
Closed

RFC: GraphSync integration #6208

hannahhoward opened this issue Apr 11, 2019 · 9 comments · Fixed by #9747

Comments

@hannahhoward
Copy link
Contributor

Context

What is GraphSync?

GraphSync is a protocol to synchronize IPLD graphs across peers. The full specification can be found here: https://github.com/ipld/specs/blob/master/graphsync/graphsync.md

Where Bitswap deals with requesting individual blocks from other peers, Graphsync assumes that the data you're interested in is node or nodes in an IPLD graph, also allows you to make requests to remote peers to return the results querying the graph using an IPLD Selector

There are several potentially several use cases for requesting data this way, but perhaps the simplest to understand is attempting to get a single node at a deeply nested path. Using only Bitswap, getting to a deeply nested path means several roundtrips of making a query for a block, receiving it, following a link, then requesting the next block, and repeating until you reach the final link in the path. With Graphsync, you could make a query to a remote node using a path selector, and have that node perform the traversal of the path locally, then send you all the blocks in the path back at once, in a single roundtrip (it has to send all the blocks in the path so you're able to verify the results locally from the block you already have)

Status of go-graphsync

go-graphsync is the initial implementation of GraphSync in go. It is nearing alpha 'feature complete' status as of April 2019, and will be ready for use around the beginning of May. go-graphsync was initially written to support use cases in Filecoin. Filecoin's integration however is not likely to begin until circa Q3 2019.

Important Caveat: The initial implementation of go-graphsync is entirely single peer to single peer -- a graphsync request is made directly to only one peer at a time. It assumes that you've already found providers for the graph. It assumes you know the peer you are requesting from has the data you want, or that you will write the code to query multiple peers outside of go-graphsync.

Status of go-ipld-prime

go-ipld-prime is a complete rewrite of the go implementation of the IPLD specification that the IPLD team has been working on for some time. It is relevant to GraphSync because:

  1. go-ipld-prime is the only implementation of IPLD in go that supports IPLD selectors
  2. go-graphsync therefore relies on go-ipld-prime in its implementation and assumes the underlying node format for the DAG one is compatible with go-ipld-prime (relevant in particular because go-ipld-prime does not currently support nodes encoded in protobuf)

go-ipld-prime is a very very different looking library than go-ipld-format and switching all or parts of IPFS to use it is a potentially large task.

IPFS Use Cases For Graphsync

This RFC is in part to identify potential use cases for GraphSync

The most obvious use case for Graphsync is working with UnixFS directories. GraphSync provides an efficient method for requesting a deeply nested path in a UnixFS directory. With augmentation, it might provide a more efficient way to transfer entire UnixFS directories without being as many roundtrips to traverse the directory and request more nodes.

There are potential other use cases anywhere the data that IPFS works with is in a DAG structure. We can use discussion in this issue to identify some of these use cases.

Integration Path, Questions, Challenges

UnixFS & GraphSync

As stated before, GraphSync relies on go-ipld-prime which currently does not support nodes that use an internal protobuf serialization format. UnixFS, at least in its v1 implementation, uses protobufs for serializing nodes.

Therefore, to integrate GraphSync with UnixFS, we would either need to augment go-ipld-prime with protobuf support OR we would need to first complete UnixFS v2, which is intended to be based on go-ipld-prime directly. Given that supporting protobufs in go-ipld-prime is non-trivial and potentially quite challenging, and moreover that having UnixFS v2 complete would unlock a number of potential features, it seems like it would much easier to simply prioritize UnixFS V2 and wait on its completion to integrate GraphSync.

Independent CoreAPI GraphSync

While not as potentially useful to users of IPFS, or most importantly package managers, it would be useful for real world testing to be able to make GraphSync queries in the real IPFS network from the IPFS Core API or the command line. Enabling GraphSync in the CoreAPI would also allow people writing applications on top of IPFS to potentially experiment with how GraphSync might unlock new types of uses for IPFS.

The path to CoreAPI GraphSync integration is potentially much shorter than integration in UnixFS-- it simply requires agreeing to a specification for what CoreAPI function signatures would look like, and then implementing them. There are no obvious blockers for proceeding with CoreAPI integration once go-graphsync is feature complete

Future Integrations / Supporting Providers/DHT Work

The ability to query into a graph from a root node might pair well with efforts to experiment with different strategies for providing only some nodes in a DAG. However, this probably will need to wait for further work on provider strategies.

@Stebalien
Copy link
Member

Therefore, to integrate GraphSync with UnixFS, we would either need to augment go-ipld-prime with protobuf support OR we would need to first complete UnixFS v2, which is intended to be based on go-ipld-prime directly. Given that supporting protobufs in go-ipld-prime is non-trivial and potentially quite challenging, and moreover that having UnixFS v2 complete would unlock a number of potential features, it seems like it would much easier to simply prioritize UnixFS V2 and wait on its completion to integrate GraphSync.

Supporting DagPB (protobuf nodes) should actually be pretty easy, unless I'm missing something. The real issue here is that UnixFSv1 stores another protobuf inside this DagPB node. From the perspective of IPLD, this inner protobuf is "just bytes". That means selectors aren't going to understand it in any meaningful way.

A good first approach may be to integrate go-ipld-prime into go-ipfs first (no selectors). That should make integrating go-graphsync easier.

Regardless, I'd like to prioritize UnixFSv2. Also, take a look at ipld/legacy-unixfs-v2#24.

@Stebalien
Copy link
Member

The path to CoreAPI GraphSync integration is potentially much shorter than integration in UnixFS-- it simply requires agreeing to a specification for what CoreAPI function signatures would look like, and then implementing them. There are no obvious blockers for proceeding with CoreAPI integration once go-graphsync is feature complete

This sounds like an improved dag API, right? ipfs dag select .... Or were you thinking about something else?

@hannahhoward
Copy link
Contributor Author

@Stebalien yes the DAG API you mention above is somewhat what I had in mind.

@hannahhoward
Copy link
Contributor Author

And the value of an ipfs dag select is it allows us to potentially integrate graphsync before unixfs v2 / go-ipld-prime, which is a giant task.

@musalbas
Copy link

I was wondering if there has been any progress on this issue, and if there are any plans to integrate GraphSync into IPFS as a replacement to Bitswap.

We are using IPFS for block distribution, and in our protocol we use "data availability proofs" where nodes only needs to download chunks from blocks and their associated Merkle proofs: celestiaorg/celestia-core#35

As each node in the proof requires a round trip to traverse, Bitswap's latency is high for this. We have done some experiments and found that the average latency to do a data availability check on a block is about ~2 seconds: celestiaorg/ipld-plugin-experiments#9

@Wondertan
Copy link
Member

Wondertan commented Feb 17, 2021

@musalbas, the latency you have is due to the nature of NMT, which is a binary tree rather than a DAG. That is, IPFS fetches only 2 links per roundtrip with NMT, while IPFS DAG currently maxes out to 174 making it somewhat usable.

In LL's case, using Graphsync(GS) is crucial and luckily it already exists within IPFS, but obviously, that is not a full integration and migration to ipld-prime for IPFS is required first to operate over GS, as it relies on new IPLD and specifically featureful selectors. Therefore, LL's IPLD plugin has to migrate as well to support GS and importantly selectors for an ability to fetch e.g. the whole batch of data related to some namespace from Storage nodes without any roundtrips at all. Thus, I suggest you first redesign the current NMT integration with the new IPLD and then use GS directly even without waiting for it to be fully integrated into IPFS, like Lotus currently does. Thanks, PL team for the modular nature of their protocols :)

cc @liamsi

@musalbas
Copy link

Does the current support make use of IPFS peer discovery or do we have to manually input nodes to get them to exchange data over GraphSync? Note that "storage nodes" in the LazyLedger paper is a redundant concept and should actually be replaced with "IPFS", so that data is downloaded via IPFS instead of a specific storage node. IPFS itself is treated as a black-box "storage node".

@Wondertan

This comment has been minimized.

@willscott
Copy link
Contributor

@musalbas There is some queued up work for bitswap/graphsync in IPFS that should help 😄

Jorropo added a commit that referenced this issue Nov 17, 2023
Updates: #9396
Closes: #6831
Closes: #6208

Currently the Graphsync server is not widely used due to lack of compatible software.
There have been many years yet we are unable to find any production software making use of the graphsync server in Kubo.

There exists some in the filecoin ecosystem but we are not aware of uses with Kubo.
Even in filecoin graphsync is not the only datatransfer solution available like it could have been in the past.

Kubo is consistently one of the fastest software to update to a new go-libp2p release.
`go-graphsync` is also developped on many concurrent branches.
The specification for graphsync are less clear than the trustless gateway one and lack a complete conformance test suite any implementation can run.
It is not easily extansible either because selectors are too limited for interesting queries without sideloading ADLs, which for now are hardcoded solutions.
This means the burden to track go-libp2p changes in go-graphsync falls on us, else Kubo cannot compile even if almost all users do not use this feature.
We are then removing the graphsync server experiment.

For people who want alternatives we would like you to try the Trustless-Gateway-over-Libp2p experiment instead, the protocol is simpler (request-response-based) and let us reuse both clients and servers with minimal injection in the network layer.
If you think this is a mistake and we should put it back you should try to answer theses points:
- Find a piece of opensource code which uses a graphsync client to download data from Kubo.
- Why is Trustless-Gateway-over-Libp2p not suitable instead ?
- Why is bitswap not suitable instead ?

Implementation details such as go-graphsync performance vs boxo/gateway is not very interesting to us in this discussion unless they are really huge (in the range of 10x~100x+ more) because the gateway code is under high development and we would be interested in fixing theses.
Jorropo added a commit that referenced this issue Nov 18, 2023
Updates: #9396
Closes: #6831
Closes: #6208

Currently the Graphsync server is not widely used due to lack of compatible software.
There have been many years yet we are unable to find any production software making use of the graphsync server in Kubo.

There exists some in the filecoin ecosystem but we are not aware of uses with Kubo.
Even in filecoin graphsync is not the only datatransfer solution available like it could have been in the past.

`go-graphsync` is also developped on many concurrent branches.
The specification for graphsync are less clear than the trustless gateway one and lack a complete conformance test suite any implementation can run.
It is not easily extansible either because selectors are too limited for interesting queries without sideloading ADLs, which for now are hardcoded solutions.
Finaly Kubo is consistently one of the fastest software to update to a new go-libp2p release.
This means the burden to track go-libp2p changes in go-graphsync falls on us, else Kubo cannot compile even if almost all users do not use this feature.
We are then removing the graphsync server experiment.

For people who want alternatives we would like you to try the Trustless-Gateway-over-Libp2p experiment instead, the protocol is simpler (request-response-based) and let us reuse both clients and servers with minimal injection in the network layer.
If you think this is a mistake and we should put it back you should try to answer theses points:
- Find a piece of opensource code which uses a graphsync client to download data from Kubo.
- Why is Trustless-Gateway-over-Libp2p not suitable instead ?
- Why is bitswap not suitable instead ?

Implementation details such as go-graphsync performance vs boxo/gateway is not very interesting to us in this discussion unless they are really huge (in the range of 10x~100x+ more) because the gateway code is under high development and we would be interested in fixing theses.
Jorropo added a commit that referenced this issue Nov 22, 2023
Updates: #9396
Closes: #6831
Closes: #6208

Currently the Graphsync server is not widely used due to lack of compatible software.
There have been many years yet we are unable to find any production software making use of the graphsync server in Kubo.

There exists some in the filecoin ecosystem but we are not aware of uses with Kubo.
Even in filecoin graphsync is not the only datatransfer solution available like it could have been in the past.

`go-graphsync` is also developped on many concurrent branches.
The specification for graphsync are less clear than the trustless gateway one and lack a complete conformance test suite any implementation can run.
It is not easily extansible either because selectors are too limited for interesting queries without sideloading ADLs, which for now are hardcoded solutions.
Finaly Kubo is consistently one of the fastest software to update to a new go-libp2p release.
This means the burden to track go-libp2p changes in go-graphsync falls on us, else Kubo cannot compile even if almost all users do not use this feature.
We are then removing the graphsync server experiment.

For people who want alternatives we would like you to try the Trustless-Gateway-over-Libp2p experiment instead, the protocol is simpler (request-response-based) and let us reuse both clients and servers with minimal injection in the network layer.
If you think this is a mistake and we should put it back you should try to answer theses points:
- Find a piece of opensource code which uses a graphsync client to download data from Kubo.
- Why is Trustless-Gateway-over-Libp2p not suitable instead ?
- Why is bitswap not suitable instead ?

Implementation details such as go-graphsync performance vs boxo/gateway is not very interesting to us in this discussion unless they are really huge (in the range of 10x~100x+ more) because the gateway code is under high development and we would be interested in fixing theses.
Jorropo added a commit that referenced this issue Nov 22, 2023
Updates: #9396
Closes: #6831
Closes: #6208

Currently the Graphsync server is not widely used due to lack of compatible software.
There have been many years yet we are unable to find any production software making use of the graphsync server in Kubo.

There exists some in the filecoin ecosystem but we are not aware of uses with Kubo.
Even in filecoin graphsync is not the only datatransfer solution available like it could have been in the past.

`go-graphsync` is also developped on many concurrent branches.
The specification for graphsync are less clear than the trustless gateway one and lack a complete conformance test suite any implementation can run.
It is not easily extansible either because selectors are too limited for interesting queries without sideloading ADLs, which for now are hardcoded solutions.
Finaly Kubo is consistently one of the fastest software to update to a new go-libp2p release.
This means the burden to track go-libp2p changes in go-graphsync falls on us, else Kubo cannot compile even if almost all users do not use this feature.
We are then removing the graphsync server experiment.

For people who want alternatives we would like you to try the Trustless-Gateway-over-Libp2p experiment instead, the protocol is simpler (request-response-based) and let us reuse both clients and servers with minimal injection in the network layer.
If you think this is a mistake and we should put it back you should try to answer theses points:
- Find a piece of opensource code which uses a graphsync client to download data from Kubo.
- Why is Trustless-Gateway-over-Libp2p not suitable instead ?
- Why is bitswap not suitable instead ?

Implementation details such as go-graphsync performance vs boxo/gateway is not very interesting to us in this discussion unless they are really huge (in the range of 10x~100x+ more) because the gateway code is under high development and we would be interested in fixing theses.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants