Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: IPFS Content Providing #31

Merged
merged 4 commits into from
Jun 23, 2021
Merged
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
164 changes: 164 additions & 0 deletions proposals/ipfs-content-providing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,164 @@
# Improve IPFS Content Providing

Authors: @aschmahmann

Initial PR: https://github.com/protocol/web3-dev-team/pull/31

<!--
This template is for a proposal/brief/pitch for a significant project to be undertaken by a Web3 Dev project team.
The goal of project proposals is to help us decide which work to take on, which things are more valuable than other things.
-->
<!--
A proposal should contain enough detail for others to understand how this project contributes to our team’s mission of product-market fit
for our unified stack of protocols, what is included in scope of the project, where to get started if a project team were to take this on,
and any other information relevant for prioritizing this project against others.
It does not need to describe the work in much detail. Most technical design and planning would take place after a proposal is adopted.
Good project scope aims for ~3-5 engineers for 1-3 months (though feel free to suggest larger-scoped projects anyway).
Projects do not include regular day-to-day maintenance and improvement work, e.g. on testing, tooling, validation, code clarity, refactors for future capability, etc.
-->
<!--
For ease of discussion in PRs, consider breaking lines after every sentence or long phrase.
-->

## Purpose &amp; impact
#### Background &amp; intent

_Describe the desired state of the world after this project? Why does that matter?_

<!--
Outline the status quo, including any relevant context on the problem you’re seeing that this project should solve. Wherever possible, include pains or problems that you’ve seen users experience to help motivate why solving this problem works towards top-line objectives.
-->

Currently go-ipfs users are able to utilize the public IPFS DHT to find who has advertised they have some CID in under 1.5s in 95+% of cases. However, the process of putting those advertisements into the DHT is slow (e.g. 1 minute) and is a bottleneck for users trying to make their content discoverable. Users who have moderate amounts of content on their nodes complain about their content being hard to find in the DHT as a result of their nodes' inability to advertise. Additionally, some of the measures users can take to reduce the number of provider records they emit by taking actions like only reproviding the roots of graphs (see [reprovider strategies](https://github.com/ipfs/go-ipfs/blob/09178aa717689a0ef9fd2042ad355320a16ffb35/docs/config.md#reproviderstrategy)) are not generally recommended due to some outstanding issues such as the inability to resume downloads of a DAG.
aschmahmann marked this conversation as resolved.
Show resolved Hide resolved

While R&D work on larger scale improvements to content routing is ongoing we can still take the opportunity now to make our existing system more usable and alleviate much of our users' existing pain with content routing.

After completion of this project the state should be that go-ipfs users with lots of data are able to setup nodes that can put at least 100M records in the DHT per day. Additionally, users should be empowered to not have to advertise data that is not likely to be accessed independently (e.g. blocks that are part of a compressed file).


#### Assumptions &amp; hypotheses
_What must be true for this project to matter?_
<!--(bullet list)-->
- The IPFS public DHT content provider subsystem is insufficient for important users
- The work is useful even though a more comprehensive solution will eventually be put forward, meaning either:
- Users are not willing to wait, or ecosystem growth is throttled, until we build a more comprehensive content routing solution
- The changes made here are either useful independent of major content routing changes, or the changes are able to inform or build towards a more comprehensive routing solution

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these projects are also useful for some byproducts they will have (worth counting):

  • They will probably entail designing/implementing extensible provider records (needed for payment systems, etc.)
  • They will probably entail upgrading the blockstore to a ref-counted, timestamped partial dag store; which is integral going forward for (i) any content routing caching algorithm, (ii) garbage collection.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be nice, but I'm shrinking the scope here so we don't necessarily have to tackle these together

#### User workflow example
_How would a developer or user use this new capability?_
<!--(short paragraph)-->

Users who use go-ipfs would be able to tell what percentage of their provider records have made it out to the network in a given interval and would notice more of their content being discoverable via the IPFS public DHT. Additionally, users would have a number of configurable options available to them to both modify the throughput of their provider record advertisements and to advertise fewer provider records (e.g. only advertising pin roots)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember discussing this one time. Would be a huge improvement for most of real-world uses (package manages, wikipedia snapshots)

Suggested change
Users who use go-ipfs would be able to tell what percentage of their provider records have made it out to the network in a given interval and would notice more of their content being discoverable via the IPFS public DHT. Additionally, users would have a number of configurable options available to them to both modify the throughput of their provider record advertisements and to advertise fewer provider records (e.g. only advertising pin roots)
Users who use go-ipfs would be able to tell what percentage of their provider records have made it out to the network in a given interval and would notice more of their content being discoverable via the IPFS public DHT. Additionally, users would have a number of configurable options available to them to both modify the throughput of their provider record advertisements and to advertise fewer provider records (e.g. only advertising pin roots, or only the root of each file is unixfs)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to add this in too, but it might be out of scope for this project. It's an extra feature which, while valuable, might not be as high value as the other ones here.


#### Impact
_How directly important is the outcome to web3 dev stack product-market fit?_

🔥🔥🔥 = 0-3 emoji rating

<!--
Explain why you have chosen this rating
What awesome potential impact/outcomes/results will we see if we nail this project?
-->

Probably the most visible primitive in the web3 dev stack is content addressing which allows someone to retrieve data via its CID no matter who has it. However, while content addressing allows a user to retrieve data from **anyone** it is still critical that there are systems in place that allow a user to find **someone** who has the data (i.e. content routing).

Executing well here would make it easier for users to utilize the IPFS public DHT, the mostly widely visible content routing solution in the IPFS space. This would dramatically improve usability and the onboarding experience for both new users and the experience of existing users, likely leading to ecosystem growth.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would presumably also meet a specific ask from Pinata.

#### Leverage
_How much would nailing this project improve our knowledge and ability to execute future projects?_

🎯🎯🎯 = 0-3 emoji rating

<!-- Explain the opportunity or leverage point for our subsequent velocity/impact (e.g. by speeding up development, enabling more contributors, etc)
-->

Many of the components of this proposal increase development velocity by either exposing more precise tooling for debugging or working with users, or by directly enabling future work.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These project will also likely further decouple content routing (and the complex caching algorithms it utilizes) from specific applications like bitswap and graphsync.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thus enabling higher app developer velocity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be true, but isn't necessarily the case in the MVP here.

#### Confidence
_How sure are we that this impact would be realized? Label from [this scale](https://medium.com/@nimay/inside-product-introduction-to-feature-priority-using-ice-impact-confidence-ease-and-gist-5180434e5b15)_.

<!--Explain why this rating-->
2 . We don't have direct market research demonstrating improving the resiliency of content routing will definitely lead to more people choosing IPFS or to work with the stack. However, this is a pain point for many of our users (as noted on the IPFS Matrix, Discuss and GitHub) and something we have encountered as an issue experienced by various major ecosystem members (Protocol Labs infra, Pinata, Infura, etc.).
aschmahmann marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have more data on:

  1. How this pain point has impacted them (e.g., has it prevented certain use cases)?
  2. How have they worked around it?
  3. What kind of performance they're expecting?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. It's been a problem for some use cases like package management (e.g. ipfs and pacman  ipfs/notes#84, IPFS and Gentoo Portage (distfiles) ipfs/notes#296), and pinning services have had difficulty as well.
  2. Applications can sort of get around this by advertising application names (e.g. myApp ) instead of data CIDs. However, this falls apart as the number of application users gets larger. For certain use cases ipfs-cluster could come in handy as well. Pinning services have a few different approaches that are basically 1) build a custom reprovider that tries to be a bit faster (although mostly by throwing more resources + parallelism at the problem and not tweaking the underlying DHT client usage) 2) have really high connection limits so they're connected to tons of people, and permanently connect to major gateways.
  3. I'm not sure, but mostly they just want data added to go-ipfs to just be made available for downloading without worrying about it and without it being crazy expensive to run

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. It's been a problem for some use cases like package management (e.g. [ipfs and pacman  ipfs and pacman  ipfs/notes#84]

If you have any questions on this, @BigLep feel free to ask :)


## Project definition
#### Brief plan of attack
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any new test scenarios that we'd need to develop? For example, as part of CI, should we have a test that asserts X advertisements can be made within Y seconds?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be nice to do in CI especially if those tests are publicly viewable. However, it wouldn't be so bad to just check in on our metrics since they report performance on go-ipfs master + the latest release and it already metrics it already has on provide speed. However, if we want to test some of the massive providing strategies (e.g. huge routing tables + many provides) we'll likely need some more testing

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. I don't know the landscape to have more input. A couple of more thoughts:

  1. If there is fear of regression here, then having a test that can catch that seems reasonable.
  2. If we are going to advertise that customers with massive providing strategies will see improved performance, I think we'll want to verify this in some way and should include that in the work plan.


<!--Briefly describe the milestones/steps/work needed for this project-->

- Enable downloading sub-DAGs when a user already has the root node, but is only advertising the root node
- e.g. have Bitswap sessions know about the graph structure and walk up the graph to find providers when low on peers
- Add a new command to `go-ipfs` (e.g. `ipfs provide`) that at minimum allows users to see how many of their total provider records have been published (or failed) in the last 24 hours)
- Add an option to go-libp2p-kad-dht for very large routing tables that are stored on disk and are periodically updated by scanning the network
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

- Make IPFS public DHT `put`s take <3 seconds (i.e. come close to `get` performance)
aschmahmann marked this conversation as resolved.
Show resolved Hide resolved
- Some techniques available include:
- Decreasing DHT message timeouts to more reasonable levels
- [Not requiring](https://github.com/libp2p/go-libp2p-kad-dht/issues/532) the "followup" phase for puts
- Not requiring responses from all 20 peers before returning to the user
- Not requiring responses from the 3 closest peers before aborting the query (e.g. perhaps 5 of the closest 10)
Comment on lines +92 to +97
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

having this framed as "do these things" rather than "get to these goals" will make this easier to scope / make it feel more concrete

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you referring to just Make IPFS public DHT puts take <3 seconds or more of this section? The "take <3 seconds" part is mostly because we don't have to do all of them if we hit our target with just a few of the optimizations. I listed them in order from what seems easiest to what seems hardest.

I can be more precise in this section, although I don't want to overly prescribe how this could be implemented.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right. the 'puts take <3 seconds' seems like a 'how do we know we're done', rather than a 'plan for work'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good news with some lessons learned from libp2p/go-libp2p-kad-dht#709 it turns out that we have a prototype that seems to do the job and already hits under 3s.

The big wins were:

  • Having large routing tables we intermittently refresh means lookups take 0 network hops
  • By changing the number of peers we wait on from 20 to a more flexible function, like wait for 30% of 20 responses then wait for a few hundred ms of no new responses, we dealt with the long tail slowness issues


#### What does done look like?
_What specific deliverables should completed to consider this project done?_

The project is done when users can see how much of their provide queue is complete, are able to allocate resources to increase their provide throughput until satisfied, and allocating resources is either not prohibitively expensive, or it is deemed too much work to decrease the resource allocation.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thumbs up for "continuous transparency": seeing the state of providing at all times.


#### What does success look like?
_Success means impact. How will we know we did the right thing?_

<!--
Provide success criteria. These might include particular metrics, desired changes in the types of bug reports being filed, desired changes in qualitative user feedback (measured via surveys, etc), etc.
-->

Success means that much fewer users report issues finding content, instead things either work for them or they file issues or ask questions on how to decrease their resource usage for providing. Things should just work for users who have 10-100k provider records and leave their nodes on continuously.

#### Counterpoints &amp; pre-mortem
_Why might this project be lower impact than expected? How could this project fail to complete, or fail to be successful?_

- People have other issues that the DHT put performance is just masking, which means we will not immediately be able to see the impact from this project alone
- Users will not want to spend the raw bandwidth of emitting their records even if lookups are instant
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n00b question: Do any customers complain about bandwidth today?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that I've heard of (although @Stebalien might have more info), but providing is pretty heavily limited so the DHT provide bandwidth is unlikely to be a problem today.

The question is around what happens next, i.e. once putting data in the DHT is fast there will still be users who aren't really able to use it.


Some back of the envelope math here is:

A user with 100M provider records where each record is 100 bytes (this is a large overestimate, it's more like 40, but we may want to add some more data to the records) who puts each record to 20 nodes every 24hrs uses 200GiB/day of upload bandwidth. AWS egress prices are around $0.09/GB, so around $20/month.

Again this is an overestimate and might be dwarfed by the egress costs of serving the actual data or other associated costs, but it's not 0.

https://archive.org/ has 538B webpages. If every one of those webpages (the vast majority of which I assume are not normally accessed) was to be individually addressed and advertised in the DHT daily it would be quite expensive.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation and back-of-envelope math; makes sense. Given this info, I'm assuming most (something like 99%?) of customers won't care. I assume huge dataset customers have other special requirements/needs/setup that we'll have other work to make their journey delightful anyways. Given the desire to make IPFS an exceptional tool for developers, the bandwidth increase seems acceptable to take given the benefit.

- Decreasing the query `put` time is much harder than anticipated
- Technical work required is harder than anticipated

#### Alternatives
_How might this project’s intent be realized in other ways (other than this project proposal)? What other potential solutions can address the same need?_

These alternatives are not exclusive with the proposal

1. Focus on decreasing the number of provider records
- e.g. Add more options for data reproviding such as for UnixFS files only advertising Files and Directories
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯 we should add this as a new Reprovider.Strategy (thinking.. pinned+files-roots)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, that would be nice. Maybe only announce a file if the node has the whole file in cache?

Maybe worth a discussion if that should be the default for example for browser integrations (like brave) and ipfs-desktop. If someone just wants to share some files, I don't see a reason to announce all chunks. Hunting for nodes which have just some single blocks of a file because of deduplication is probably not worth the effort of connecting to them.

- might be tricky UX and plumbing, but is something we likely will need to tackle eventually
2. Focus on decreasing the frequency of reproviding records
- e.g. Build a second routing layer where nodes are encouraged or required to have high availability (e.g. a federated routing layer or opt-in second DHT that tracks peer availability more rigorously)
- has possibility for high payoff, although has more risk associated with it

#### Dependencies/prerequisites
<!--List any other projects that are dependencies/prerequisites for this project that is being pitched.-->

- None

#### Future opportunities
<!--What future projects/opportunities could this project enable?-->

- Making it easier to implement alternative #1 above (enabled by `ipfs provide` and being able to download sub-DAGs when only the root node is provided)
- Vastly improved lookup performance of the delegated routers that can be used in js-ipfs (enabled by allowing users to have large routing tables)

## Required resources

#### Effort estimate
<!--T-shirt size rating of the size of the project. If the project might require external collaborators/teams, please note in the roles/skills section below).
For a team of 3-5 people with the appropriate skills:
- Small, 1-2 weeks
- Medium, 3-5 weeks
- Large, 6-10 weeks
- XLarge, >10 weeks
Describe any choices and uncertainty in this scope estimate. (E.g. Uncertainty in the scope until design work is complete, low uncertainty in execution thereafter.)
-->

L. There is some uncertainty in how much work will be required to increase `put` performance. However, all of the changes are client side which make them relatively easy to test. This estimate could be an overestimate as some of the changes have some uncertainty which is currently being estimated at the higher end (i.e. the work in go-ipfs and go-bitswap)

#### Roles / skills needed
<!--Describe the knowledge/skill-sets and team that are needed for this project (e.g. PM, docs, protocol or library expertise, design expertise, etc.). If this project could be externalized to the community or a team outside PL's direct employment, please note that here.-->

- 3-4x go-engineers
- 1-2x go-ipfs experience
- 1-2x go-libp2p (ideally go-libp2p-kad-dht) experience
- Some input and support may be required from research