Skip to content
This repository has been archived by the owner on Feb 1, 2023. It is now read-only.

[WIP] Work towards improving the duplicate blocks issue #8

Closed
wants to merge 12 commits into from

Conversation

whyrusleeping
Copy link
Member

Bitswap currently has a bad problem of getting duplicate blocks. That is, if we ask for X, and more than one of our peers has X, we will probably get X from each of those peers. Wasting bandwidth.

In this branch, we will attempt to fix the problem (or at least, improve it).

The first thing i've added here is a test that shows the problem. From there, we can try and get the duplication factor as low as possible.

@ghost ghost assigned whyrusleeping Aug 30, 2018
@ghost ghost added the status/in-progress In progress label Aug 30, 2018
@whyrusleeping whyrusleeping force-pushed the feat/reduce-dupes branch 2 times, most recently from 287e22f to ff672d6 Compare August 30, 2018 07:14
@whyrusleeping
Copy link
Member Author

Alright, got a few tests (more definitely needed) that show various fetching scenarios pretty well.

I've also got a 'first candidate solution' that improves things pretty nicely in some cases, but in some cases it increases latency significantly. Before we try to micro-optimize to fix that latency bump, i want to get more test scenarios that also have different latencies.

@whyrusleeping
Copy link
Member Author

Fun things to try:

  • changing the activeWantsLimit
  • changing the number of workers
  • changing the baseTickDelay
  • changing the provSearchDelay

I'm also curious to try out the 'multi-block-bitswap' branch from @taylormike

@whyrusleeping
Copy link
Member Author

The first (naive) fix i've written here just splits up the wants amongst the peers in the 'active' set, which is the set of peers we've received data from in this session. This does a decent job of load balancing, but it would be better to track which peers are faster, and weight requests to them higher.

The code also currently will send out wants to all peers until we get some peers in the 'active' set. This is problematic when we request a whole bunch of highly available data all at once, as all the wants will go out before we are able to filter anything. We can probably have a limit on the number of live wants a session can have before getting any active peer information.

@whyrusleeping
Copy link
Member Author

@Stebalien wanna take a look at the set of test scenarios here? I could use some help brainstorming more

@whyrusleeping
Copy link
Member Author

I'm now looking into what happens when the blocks we are fetching arent clustered onto the same peers (i.e. if one peer has the first item, they don't necessarily have the next). This is currently a pretty slow usecase, with fetching 100 blocks, randomly interspersed across a set of ten peers taking ~50 seconds.

My current idea is to keep track of the number of blocks we receive, and the number of 'broadcasts' we send out, and if that ratio is too low, bump up the number of peers we sent each want to until that ratio gets better. A really simple approach brought the time for this scenario down from 50 seconds to 30 seconds, which is pretty nice. Next i'm gonna go look into how this affects more optimal usecases (it shouldnt actually affect them at all, because the ratio should never drop that low)

@whyrusleeping
Copy link
Member Author

discovering something mildly annoying... it seems the requests are falling back to broadcast even in cases where they really shouldnt (like when everyone has every block). Gonna look into this...

@whyrusleeping
Copy link
Member Author

meh, was a bug in my new code. Still seeing an unfortunate number of broadcasts, but its more reasonable now.

Copy link
Member

@Stebalien Stebalien left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of nits while familiarizing myself with the code.

@@ -47,6 +48,8 @@ type impl struct {

// inbound messages from the network are forwarded to the receiver
receiver Receiver

stats NetworkStats
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this needs to go at the top of the struct as it needs to be 64-bit aligned:

The first word in a variable or in an allocated struct, array, or slice can be relied upon to be 64-bit aligned.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The compiler should really catch that for us...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should, but it doesn't...

@@ -153,6 +174,10 @@ func (s *Session) interestedIn(c *cid.Cid) bool {
const provSearchDelay = time.Second * 10

func (s *Session) addActivePeer(p peer.ID) {
if s.liveWantsLimit == broadcastLiveWantsLimit {
s.liveWantsLimit = targetedLiveWantsLimit
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a fan of using "constant" variables like this... how about a wantBudget() function (or something like that) that returns len(s.liveWants) - ((len(s.activePeers) > 0) ? targetedLiveWantsLimit : broadcastLiveWantsLimit).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, totally.

@Stebalien
Copy link
Member

Summarizing our offline discussion, I'm not sure if the dupl variable is really the right approach. I'm expecting nodes to have merkle-chains from some common root to (reasonably) complete subdags (i.e., they'll traverse a path and then read out an entire subdag (maybe up to some depth)).

bitswap

With the dupl approach:

  1. If peers have random blocks, dupl should be stable.
  2. If peers have all the blocks, dupl should stay ~1.
  3. If peers have blocks according to my diagram above, dupl should grow quite large (and fluctuate) as we randomly query peers from a set where some peers simply won't have anything we're looking for.

Without remembering anything about the DAG structure, about the only solution I can think of is to somehow split peers into "active" and "inactive" groups. That is, if a peer frequently doesn't have blocks we're looking for, we drop them to the inactive group; if we start running low on active peers, we start pulling in inactive peers.

One way to do this is to simply sort peers by some frecency (frequent/recency) metric that tracks how reliably they deliver blocks. We can then ask for blocks from the top N (where N can float based on how many duplicate blocks we appear to be receiving).

This should work fine for, e.g., sequential access. However, it won't work well with, e.g., ipfs get as that does random access.


Ideally, we'd use the DAG structure. Looking at this structure, I'd predict that a peer is more likely to have a node if we know they have either sibling or uncle nodes and less likely if we know they don't have sibling nodes.

  1. If a peer has many siblings, they probably performed some form of list, cat, etc. operation.
  2. If we know a peer is missing siblings, they probably performed a traversal through the DAG. Generally, I'd treat this as stronger evidence than (1). It's possible to have a few siblings by traversing a few paths through some node but it's unlikely to be missing siblings after reading/listing an entire file/directory.
  3. If we know a peer has uncles to the node in question, but have little information on the siblings, we can infer that they're likely to have the entire tree. IMO, uncles give the least information but it's better than nothing.

In some cases, we could probably also learn from cousins but I'm not sure if there's anything we can learn from cousins that we can't learn from uncles without assuming a balanced tree.

So, without changing the interfaces/protocol, we can probably keep a small, in-memory, shared table (cache?) mapping cids we may want (children of nodes we've fetched) to peers that likely have them (not sure how we'll keep track/update how likely they are to have them).

Given this information, I'd:

  1. Sort all of our peers (not just those in the session) by how likely they are to have the content we're looking for.
  2. Take the top non-overloaded (we should really be tracking load per-peer) until we hit some threshold probability of receiving the block on the first try.
  3. If we run out of peers to ask, we can then fall back on broadcasting to our session.

Basically, we'd:

  1. Use the prediction table when we have enough information to predict which peers likely have which blocks based on the DAG structure.
  2. Fall back on the session to try to find peers when the DAG structure doesn't give us enough information.

})
t.Run("10Nodes-OnePeerPerBlock-UnixfsFetch", func(t *testing.T) {
subtestDistributeAndFetch(t, 10, 100, onePeerPerBlock, unixfsFileFetch)
})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably have a test that looks more like reading entire files out of a large directory structure. I wonder if we should try building an actual dag.

@whyrusleeping
Copy link
Member Author

whyrusleeping commented Sep 7, 2018

Heres the results so far:

Test Run	Dups	MsgRecv	MsgSent	Time
10Nodes-AllToAll-AllConcurrent	1	788	888	1004	588738086
10Nodes-AllToAll-AllConcurrent	2	212	312	508	583817218
10Nodes-AllToAll-BatchFetchBy10	1	697	798	650	318375172
10Nodes-AllToAll-BatchFetchBy10	2	32	132	203	308959055
10Nodes-AllToAll-BigBatch	1	741	841	746	593989359
10Nodes-AllToAll-BigBatch	2	32	132	211	576777696
10Nodes-AllToAll-OneAtATime	1	791	891	1545	2317195336
10Nodes-AllToAll-OneAtATime	2	8	108	216	2253939404
10Nodes-AllToAll-UnixfsFetch	1	729	830	654	156870013
10Nodes-AllToAll-UnixfsFetch	2	8	108	167	142264026
10Nodes-OnePeerPerBlock-BigBatch	1	0	100	691	2566468029
10Nodes-OnePeerPerBlock-BigBatch	2	0	100	818	13188862723
10Nodes-OnePeerPerBlock-OneAtATime	1	0	100	1697	6751956059
10Nodes-OnePeerPerBlock-OneAtATime	2	0	100	1604	8987871527
10Nodes-OnePeerPerBlock-UnixfsFetch	1	0	100	483	1334532195
10Nodes-OnePeerPerBlock-UnixfsFetch	2	0	100	646	2486414479
AllToAll-BigBatch	1	96	196	167	583172799
AllToAll-BigBatch	2	4	104	125	576459570
AllToAll-OneAtATime	1	100	200	380	2297902493
AllToAll-OneAtATime	2	1	101	202	2277471653
Overlap1-OneAtATime	1	0	100	245	2835544709
Overlap1-OneAtATime	2	0	100	229	10793166952
Overlap2-BatchBy10	1	33	133	161	883447696
Overlap2-BatchBy10	2	2	102	124	3226089490
Overlap3-AllConcurrent	1	33	133	183	1343921547
Overlap3-AllConcurrent	2	32	132	209	1683851304
Overlap3-BatchBy10	1	33	133	142	867415276
Overlap3-BatchBy10	2	2	102	121	3815798655
Overlap3-BigBatch	1	34	134	178	1339814193
Overlap3-BigBatch	2	2	102	160	6131580369
Overlap3-OneAtATime	1	34	134	381	2274897167
Overlap3-OneAtATime	2	33	133	396	4503270140
Overlap3-UnixfsFetch	1	31	131	170	724474001
Overlap3-UnixfsFetch	2	1	101	120	1318204376

@whyrusleeping
Copy link
Member Author

Several of the tests are slower now, which is mildly disconcerting. But they have lower duplicate blocks and message counts across the board, which is a win in my book.

@Stebalien
Copy link
Member

@whyrusleeping we should test this against a gx install (leaving this here so we don't forget, I'll try to do this as soon as I get a non-mobile internet connection).

@bonekill
Copy link

Not sure if this would work or not, ignore it if it is silly.

Would sorting peers into latency groups then altering the behavior for low latency peers (~<20ms) to a 'Do you have?' vs a 'Send if you have!' strategy work?
The performance impact of an extra round trip with the low latency peers should be manageable while offering a much larger degree of control over incoming data.
If you get positive hits back on the query then you can just request the block from one of the respondents. If you don't get a response back or the low latency group is to small to justify the extra round trip continue with other strategies.

While this does not solve the problem it might help mitigate it.

@whyrusleeping
Copy link
Member Author

@bonekill Yeah, the idea of having a 'do you have' protocol extension is interesting. There is also the question of 'will you send this to me?' which i think is the more important one. Note that this is only really important in the case where peers dont have the data. In the optimal case, with many peers who all have the data, there is no reason to add that extra overhead.

@bonekill
Copy link

Also back to the original idea you are working on, you could utilize the incoming requests as well to build the probability of nodes having data.
Ex. If a peer A recently requested a block (or related block) that you did not have at the time but now want, it likely that peer A now possesses the data.
This helps you build the 'active' set or a probability model passively without having to make requests of your own, given that the node has a chance to listen for a bit that is.

@whyrusleeping whyrusleeping force-pushed the feat/reduce-dupes branch 2 times, most recently from 08464d3 to 5aa8341 Compare September 17, 2018 22:36
@whyrusleeping
Copy link
Member Author

@bonekill Yeah, @Stebalien and I were chatting about having a generalized 'who might have X' subsystem that could be informed by many different inputs. I think that will help significantly moving forward.

In any case, I'd like to get this PR merged roughly as is. @Stebalien do you think you could give me a hand?

@Stebalien
Copy link
Member

If this is the correct way to use a session, I've tried this patch when listing wikipedia on IPFS and, unfortunately, it still appeared to get 1:1 duplicate blocks and have a 2/1 upload to download speed.

We should do some more through tests to figure out what's actually going on. @hannahhoward?

@Stebalien
Copy link
Member

Still not seeing any improvement.

@schomatis
Copy link
Contributor

@schomatis have you for sure seen the PR get used linked to the main repo?

Yes. Besides updating go-ipfs-blocksutil in the main go-ipfs repo you'll also need to update it in the go-ipfs-exchange-offline repo you're using (linked or not). To be sure go to go-ipfs and run gx deps --tree and check if somebody else is using QmUf9x (the old 0.1.0 version).

@Stebalien
Copy link
Member

Blocked on passing the new tests. The answer is probably to just skip them or relax them but it still needs to pass them.

@Stebalien
Copy link
Member

So, I finally got this to work by just bubbling the changes (hopefully correctly, I didn't get any dupes). Unfortunately, the results are not encouraging. While pinning wikipedia:

  1. Initially, I noticed that the upload bandwidth was significantly reduced.
  2. After a few seconds, download bandwidth drops to match upload bandwidth.
  3. I've seen no improvements in the number of duplicate blocks received (actually, it looked slightly worse but was still about 1:1 in all tests)

As for bandwidth, I tested 4 variants:

  1. Current: ipfs master
  2. Current + Large Wantlist: increase wantlist to 64, parallel fetch concurrency to 32
  3. Patch: this patch as-is.
  4. Patch + Large Wantlist: increase broadcast size to 16, targeted want limit to 64, parallel fetch concurrency to 32.
Current: ~217KiB/s
Current + Large Wantlist: ~272/s (starts much higher and drops off, probably due to upload)
Patch: ~47KiB/s
Patch + Large Wantlist: ~93KiB/s

(all of these start off slightly (except 2 which starts off significantly) higher and then drop down)

So, maybe I've just been testing this wrong but, at the very minimum, we need decent benchmarks showing some improvement before merging this.

@hannahhoward
Copy link
Contributor

@Stebalien

I tend to agree. I think the first priority is real world benchmarks, so we have a more systematic way to test whether we're making a real world difference.

On the question of merging -- I am also skeptical of a large change that add complexity with a seemingly negative affect on transfer speed. I think once we have real world benchmarks, it's worth revisiting this PR to see if there's something we're missing, cause maybe there is just a small change needed to unlock the improvements.

I'm going to start looking at real world benchmark tests today. I'd recommend we either close this PR (it's still archived and can be re-opened later) or add a WIP label to indicate it needs further work.

Also the tests in this PR are valuable on their own (I think) and I wonder if they should be split off.

@Stebalien
Copy link
Member

I'm going to start looking at real world benchmark tests today. I'd recommend we either close this PR (it's still archived and can be re-opened later) or add a WIP label to indicate it needs further work.

I'll add a WIP.

Also the tests in this PR are valuable on their own (I think) and I wonder if they should be split off.

I agree. Are you volunteering 😄?

@Stebalien Stebalien changed the title Work towards improving the duplicate blocks issue [WIP] Work towards improving the duplicate blocks issue Oct 23, 2018
@schomatis
Copy link
Contributor

3. I've seen no improvements in the number of duplicate blocks received (actually, it looked slightly worse but was still about 1:1 in all tests)

Yes, what I meant with my previous (not very clear) comment was that the broadcast system (through the liveWants slice) seems to still largely dominate the requests made in the bitswap session. This patch just avoids sending the new wanted blocks directly to the broadcast want list (bcwl) but those blocks will end up there anyway every few seconds,

go-bitswap/session.go

Lines 209 to 218 in a2ce7a2

case <-s.tick.C:
live := make([]cid.Cid, 0, len(s.liveWants))
now := time.Now()
for c := range s.liveWants {
live = append(live, c)
s.liveWants[c] = now
}
// Broadcast these keys to everyone we're connected to
s.bs.wm.WantBlocks(ctx, live, nil, s.id)

How long that time is will depend on how fast new (requested) blocks arrive, and that in turn depends heavily only on the broadcast requests, since the targeted requests to specific peers through the duplicate block system manages absolute and relatively small values of 3-10 while the broadcast is relative to the number of peers, an order of magnitude (or more) than the duplicate blocks requests.

This is not reflected in the local tests because with the fake 10 ms delay in the VirtualNetwork the broadcast system doesn't have time to kick in, as a simple example, reducing the broadcast timer to (also a fake and impractical but) comparable 10 ms delay the duplicate blocks greatly increase,

go-bitswap/session.go

Lines 164 to 171 in a2ce7a2

func (s *Session) resetTick() {
if s.latTotal == 0 {
s.tick.Reset(provSearchDelay)
} else {
avLat := s.latTotal / time.Duration(s.fetchcnt)
s.tick.Reset(s.baseTickDelay + (3 * avLat))
}
}

I think that more insightful statistics (beyond the useful duplicate block count) will be needed to understand the interaction between the two request systems beyond black box testing dl/ul bandwidth.

@whyrusleeping
Copy link
Member Author

@schomatis Yeah, we should try ramping up the baseTickDelay and provSearchDelay values. We should also set up a more 'real world' test case, and try to get some real numbers and reproducibility out of them.

hannahhoward added a commit that referenced this pull request Nov 14, 2018
- add a delay generator that similates real world latencies one might encounter on the internet
- modify virtual network to accept different latencies for different
peers based on using NextWaitTime on passed delay
- modify dup_blocks_test subtestDistributeAndFetch to accept a custom
delay
- Add a real world benchmarks that simulates the kinds of problems one might
encounter bitswaping with a long lived session and a large swarm of
peers with real world latency distribution (that causes #8 not to
function well in practice)
hannahhoward added a commit that referenced this pull request Nov 14, 2018
- add a delay generator that similates real world latencies one might encounter on the internet
- modify virtual network to accept different latencies for different
peers based on using NextWaitTime on passed delay
- modify dup_blocks_test subtestDistributeAndFetch to accept a custom
delay
- Add a real world benchmarks that simulates the kinds of problems one might
encounter bitswaping with a long lived session and a large swarm of
peers with real world latency distributions (that causes #8 not to
function well in practice)
hannahhoward added a commit that referenced this pull request Nov 14, 2018
- add a delay generator that similates real world latencies one might encounter on the internet
- modify virtual network to accept different latencies for different
peers based on using NextWaitTime on passed delay
- modify dup_blocks_test subtestDistributeAndFetch to accept a custom
delay
- Add a real world benchmarks that simulates the kinds of problems one might
encounter bitswaping with a long lived session and a large swarm of
peers with real world latency distributions (that causes #8 not to
function well in practice)
@eingenito
Copy link

eingenito commented Dec 11, 2018

I'm going to close this as it has been incorporated into @hannahhoward's work in #26 #28 #29 #30 and is being tracked in #27

Of course feel free to reopen if I got that wrong.

@eingenito eingenito closed this Dec 11, 2018
@ghost ghost removed the status/in-progress In progress label Dec 11, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants