Limit Push Graphsync requests #243

hannahhoward · 2021-08-24T23:30:57Z

Goals

A provider should be able to limit simultaneous requests for both STORAGE and RETRIEVAL deals
A provider should be protected from a very active client who sends lots and lots of storage deals

Implementation

When Graphsync was written, it was assumed that requests are always initiated by a real human user, meaning rate limiting outgoing Graphsync requests is both unneccesary and actually counterproductive.
Data transfer's Push mode complicates this assumption. A data transfer push request, while validated with a validator, is initiated by a remote peer, and triggers a Graphsync request. In effect, it's an RPC call to initiate a graphsync request
As such, it's vulnerable to DDOS attacks from what appears to be the outgoing request side of graphsync
This PR employs the same mechanism we've used elsewhere -- the peertaskqueue with a set number of simultaneous workers -- to rate limit graphsync requests triggered from remote peers sending Push requests through go-data-transfer
In keeping with data transfers general architecture of using the Transport layer to adapt a protocol library to the needs of data transfers control layers, the queue of requests is implemented at the Graphsync transport layer, on the theory that rate limiting is going to be different depending on the protocol
Concretely, we introduce a RequestQueue that manages request with the PeerTaskQueue -- this both limits the number of simultaneous requests for Graphsync AND load balances among peers prioritizing everyone getting 1 request for a single demanding peers gets lots of requests served
At the graphsync transport layer, when opening a channel, if the Graphsync request is initiated by the other peer (not us), we put it in the request queue, and it executes when there is an available worker in the queue
We still execute graphsync requests initiated by our peer immediately -- we wouldn't want to rate limit say initiating 100 retrievals with 100 different providers.
When deferring the requests, we deferring requests, we dispatch a "TransferQueued" event similar to how we do for the side receiving an incoming Graphsync request.
There's a test to demonstrate the rate limiting, and the functioning of the peertaskqueue at this point is generally well proven through it's use in Bitswap and go-graphsync

For discussion

Should we dispatch a TransferredRequestQueued for outgoing graphsync requests initiated locally? This would close the loop so that every single data transfer from every single side got this event
We do not put restarts in the queue for now -- the assumption is we want these to get run immediately. As currently written this does free the queue when the original request gets cancelled. Can't tell if this is a problem.
There are potential additional races introduced here due to more go-routines and asynchronous behavior -- it is probably neccessary to test this heavily before promoting all the way to Lotus master.

codecov-commenter · 2021-08-24T23:35:12Z

Codecov Report

Merging #243 (6427dba) into master (e69ae98) will decrease coverage by 0.01%.
The diff coverage is 71.42%.

@@            Coverage Diff             @@
##           master     #243      +/-   ##
==========================================
- Coverage   67.90%   67.88%   -0.02%     
==========================================
  Files          24       25       +1     
  Lines        3050     3117      +67     
==========================================
+ Hits         2071     2116      +45     
- Misses        624      640      +16     
- Partials      355      361       +6

Impacted Files	Coverage Δ
transport/graphsync/graphsync.go	`77.58% <67.50%> (-0.34%)`	⬇️
transport/graphsync/requestqueue/requestqueue.go	`75.00% <75.00%> (ø)`
impl/events.go	`75.00% <100.00%> (-0.09%)`	⬇️
network/libp2p_impl.go	`68.38% <0.00%> (-2.95%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e69ae98...6427dba. Read the comment docs.

aarshkshah1992

Mostly questions.

aarshkshah1992 · 2021-08-25T05:48:30Z

transport/graphsync/graphsync.go

@@ -27,6 +28,8 @@ var log = logging.Logger("dt_graphsync")
 // cancelled.
 const maxGSCancelWait = time.Second

+const defaultMaxInProgressPushRequests = 6


@hannah

Do we plan to bubble this config upto to Lotus so we can set to a much higher number just like we do for the Graphsync simultaneous transfer requests ? The default seems low just like we've seen with Graphsync.

yes that is the intent -- and probably we'll just apply the same MaxSimultaneousTransfers setting.

aarshkshah1992 · 2021-08-25T05:51:34Z

transport/graphsync/graphsync.go

 	return t
 }

+// Start starts the request queue for incoming push requests that trigger outgoing
+// graphsync requests
+func (t *Transport) Start(ctx context.Context) {


So, we'll need to call this Start in Lotus after constructing a New Transport, right ?

aarshkshah1992 · 2021-08-25T05:52:45Z

transport/graphsync/graphsync.go

@@ -149,19 +171,46 @@ func (t *Transport) OpenChannel(ctx context.Context,
 		exts = append(exts, doNotSendExt)
 	}

-	// Start tracking the data-transfer channel


nit: revert.

aarshkshah1992 · 2021-08-25T06:29:33Z

transport/graphsync/graphsync.go

 	ch := t.trackDTChannel(channelID)

+	// if we initiated this data transfer or the channel is open, run it immediately, otherwise
+	// put it in the request queue to rate limit DDOS attacks for push transfers
+	if channelID.Initiator == t.peerID || ch.isOpen {


@hannahhoward Why do we need the ch.isOpen condition here ?

@dirkmc What does ch.Open mean semantically for the data-transfer ? What is the difference between two existing channels where one as ch.Open=true and one has ch.Open=false ?

Once graphsync has called the outgoing requests hook, the channel is open (isOpen is true).

I think the idea here is that if the transfer is restarted, data-transfer will call OpenChannel again. So we don't want to put restarts at the back of the queue (OpenChannel will cancel the existing request and start a new one in its place)

We may need to do a bit more refactoring here.
In the case where a transfer is restarted while the request is still in the queue we probably want to just replace the original request.

@dirkmc

You are right. So, if a queued transfer gets restarted, this PR will add a duplicate request to the queue, right ?

Yes, for push requests I think it will add a duplicate request to the queue.

we actually have a seperate problem -- it will not, ever, even if the request is in progress, due to the way go-peertaskqueue works (merging tasks). honestly, I need to figure this one out a bit more deeply.

aarshkshah1992 · 2021-08-25T06:31:38Z

transport/graphsync/graphsync.go

@@ -906,6 +956,13 @@ func (c *dtChannel) open(ctx context.Context, chid datatransfer.ChannelID, dataS
 		// Mark the channel as open and save the Graphsync request key
 		c.isOpen = true
 		c.gsKey = &gsKey
+		if c.startPaused {
+			err := c.gs.PauseRequest(c.gsKey.requestID)


@hannahhoward - What is the need for this new code blob ? Please can you add some docs to this ?

if someone calls pause before the request gets to the front of the queue -- that's what it's for.

added comment

aarshkshah1992 · 2021-08-25T06:33:21Z

transport/graphsync/graphsync.go

@@ -980,6 +1037,11 @@ func (c *dtChannel) pause() error {
 	c.lk.Lock()
 	defer c.lk.Unlock()

+	if c.gsKey == nil {


So, this is to support the case where PauseChannel is called before we've started the graphsync transfer for a channel that's already opened, right ?

added comment

dirkmc · 2021-08-25T10:28:37Z

impl/restart_integration_test.go

@@ -214,7 +214,7 @@ func TestRestartPush(t *testing.T) {

 			// WAIT FOR DATA TRANSFER TO FINISH -> SHOULD WORK NOW
 			// we should get 2 completes
-			_, _, err = waitF(10*time.Second, 2)
+			_, _, err = waitF(100000*time.Second, 2)


Looks like maybe a change got added while debugging?

oops will fix

dirkmc · 2021-08-25T10:33:27Z

transport/graphsync/graphsync.go

 	ch := t.trackDTChannel(channelID)

+	// if we initiated this data transfer or the channel is open, run it immediately, otherwise
+	// put it in the request queue to rate limit DDOS attacks for push transfers
+	if channelID.Initiator == t.peerID || ch.isOpen {


Once graphsync has called the outgoing requests hook, the channel is open (isOpen is true).

I think the idea here is that if the transfer is restarted, data-transfer will call OpenChannel again. So we don't want to put restarts at the back of the queue (OpenChannel will cancel the existing request and start a new one in its place)

dirkmc · 2021-08-25T10:36:42Z

transport/graphsync/graphsync.go

 	ch := t.trackDTChannel(channelID)

+	// if we initiated this data transfer or the channel is open, run it immediately, otherwise
+	// put it in the request queue to rate limit DDOS attacks for push transfers
+	if channelID.Initiator == t.peerID || ch.isOpen {


We may need to do a bit more refactoring here.
In the case where a transfer is restarted while the request is still in the queue we probably want to just replace the original request.

dirkmc · 2021-08-25T10:37:51Z

transport/graphsync/graphsync.go

+		// Start tracking the data-transfer channel
+
+		// Open a graphsync request to the remote peer
+		req, err := ch.open(ctx, channelID, dataSender, root, stor, doNotSendCids, exts)


Suggest passing the ch object, rather than passing channelID here and then calling getDTChannel(channelID) in runDeferredRequest

what's the value of that? Just seems like putting more data on the queue.

The extra memory is only the size of a pointer - the value is that you don't have to call getDTChannel(channelID) because you just keep a reference to the object. Not a biggie though I think it's fine as is.

dirkmc · 2021-08-25T10:53:40Z

transport/graphsync/requestqueue/requestqueue.go

+	}
+}
+
+func (orq *RequestQueue) processRequestWorker(ctx context.Context) {


In this model there are many threads that all get "woken up" each time a task is added to the queue, and then all try to pop tasks from the queue, but most of the time all but one will get an empty task list.

I wonder if it would be possible to change the model so that there are initially zero go routines running. When a task is added to the queue a new go routine is started to process it. Each time a new go routine is needed it's started, up to some maximum number of go routines. As tasks complete, if the go routine is no longer needed it stops, all the way down to zero go routines.

This would allow us to use the minimum number of required go routines, and we would no longer need the Start() method.

I'm not sure if it's possible, but it would be nice if it is possible.

It would be maybe. I'd rather ship that as an improvement. The logic here is taken directly from go-graphsync, which makes me more comfortable in terms of introducing a new concurrency element in that it's at least been battle tested there.

hannahhoward · 2021-08-25T20:36:42Z

@dirkmc @aarshkshah1992

I've refactored the behavior around restart requests so that we get safer behavior out of them.

They now go in the request queue, but receive a higher priority.

Moreover, the merging behavior should insure:

a restart request gets in the queue when the active request is still running
at the same time, if a restart is called on a queued request that has not started running, it will not duplicate said request (the queue will merge them as the same task but increase the priority)

One more thing remains -- should we cancel the existing request prior to queueing the restart? I think this might be important, cause if we had several stalled push requests, we might never get the restart request to the top of the queue. I will look into doing this.

hannahhoward · 2021-08-26T00:10:36Z

@dirkmc @aarshkshah1992 following @dirkmc's suggestion I did one more refactor to hide some of the request details nitty gritty from the request queue. I also made sure we cancel in progress requests prior to queueing deferred restarts. I'm feeling pretty good about where this is at now.

add a request queue designed to execute push-graphsync requests in a rate limited fashion

integrate request queue into graphsync transport and demonstrate push request rate limiting

put restarts into the queue, ahead of new requests, and don't duplicate for pending

make request queue more abstract by accepting an interface and do cancellations on inprogress requests right away

aarshkshah1992 · 2021-08-26T06:35:16Z

transport/graphsync/graphsync.go

-	// Process incoming data
-	go t.executeGsRequest(req)
+	// if we initiated this data transfer or the channel is open, run it immediately, otherwise
+	// put it in the request queue to rate limit DDOS attacks for push transfers


nit: fix docs.

aarshkshah1992 · 2021-08-26T06:35:28Z

transport/graphsync/graphsync.go

+	// if we initiated this data transfer or the channel is open, run it immediately, otherwise
+	// put it in the request queue to rate limit DDOS attacks for push transfers
+	if channelID.Initiator == t.peerID {
+		// Start tracking the data-transfer channel


nit: remove this line.

aarshkshah1992 · 2021-08-26T06:48:21Z

transport/graphsync/graphsync.go

+
+// Open a graphsync request for data to the remote peer
+func (c *dtChannel) open(ctx context.Context, chid datatransfer.ChannelID, dataSender peer.ID, root ipld.Link, stor ipld.Node, doNotSendCids []cid.Cid, exts []graphsync.ExtensionData) (*gsReq, error) {
+	c.lk.Lock()


@hannahhoward - I think we should also keep the "cancel before a new open logic" in here as well by making a call to cancelInProgressRequests just in case we've missed an edge case where a request is already running for a channel and we pull a deferred request for the same channel from the queue. We can ignore the error of that call but dosen't hurt to have that to keep things easy to reason about.

dirkmc · 2021-08-26T07:02:24Z

transport/graphsync/graphsync.go

@@ -152,14 +179,34 @@ func (t *Transport) OpenChannel(ctx context.Context,
 	// Start tracking the data-transfer channel
 	ch := t.trackDTChannel(channelID)

-	// Open a graphsync request to the remote peer
-	req, err := ch.open(ctx, channelID, dataSender, root, stor, doNotSendCids, exts)
+	wasInProgress, err := ch.cancelInProgressRequests(ctx)


It looks like the call to cancelInProgressRequests may be racy if two processes call OpenChannel at the same time - I'd suggest putting all this logic in a method on dtChannel so that it can take place under a single lock

dirkmc · 2021-08-26T07:08:20Z

transport/graphsync/graphsync.go

+		// Start tracking the data-transfer channel
+
+		// Open a graphsync request to the remote peer
+		req, err := ch.open(ctx, channelID, dataSender, root, stor, doNotSendCids, exts)


The extra memory is only the size of a pointer - the value is that you don't have to call getDTChannel(channelID) because you just keep a reference to the object. Not a biggie though I think it's fine as is.

dirkmc · 2021-08-26T07:09:22Z

transport/graphsync/graphsync.go

+			if err != nil {
+				return nil, err
+			}
+			c.startPaused = false


Does it make sense to set startPaused to false here? What if there's a restart before unsealing completes?

dirkmc · 2021-08-26T07:50:15Z

transport/graphsync/graphsync.go

+type deferredRequest struct {
+	t             *Transport
+	ch            *dtChannel
+	channelID     datatransfer.ChannelID


nit: dtChannel already has a channelID field

dirkmc · 2021-08-26T07:54:09Z

transport/graphsync/graphsync.go

+	// Open a graphsync request to the remote peer
+	req, err := dr.ch.open(dr.ctx, dr.channelID, dr.dataSender, dr.root, dr.stor, dr.doNotSendCids, dr.exts)
+	if err != nil {
+		log.Warnf("error processing request from %s: %s", dr.dataSender, err)


Is there a way we can surface this error to the calling code? My concern is that a data transfer will fail, but the markets code will not notice so it will get stuck in a "transferring" state forever.

raulk

I think we should invest in making this throttling logic live in go-graphsync!

For sake of completeness and context, here are the big picture paths we aim to throttle:

Storage deal:
1. Client sends PUSH data-transfer request to provider.
2. Provider receives PUSH data-transfer request.
3. Provider initiates a go-graphsync request. The provider is the graphsync REQUESTOR.
4. Client receives the graphsync request, and acts as the RESPONDER.
Retrieval:
1. Client initiates a PULL data-transfer operation locally.
2. data-transfer translates this to a graphsync request. The client is the REQUESTOR.
3. Provider receives the graphsync request, and acts as the RESPONDER.

Fortunately, we already have throttling at the graphsync layer in place on the RESPONDER side of things. That is, in 1.iv and 2.iii (client during storage deal, provider during retrieval).

We now want to introduce throttling on the REQUESTOR side of things. This PR introduces it in step 1.ii in the data-transfer layer. But I'm convinced this is better handled at 1.iii and 2.ii in the graphsync layer (provider during storage deal, client during retrieval deal).

The throttling logic ends up being split across go-data-transfer and go-graphsync, making things even harder to reason about and introducing even more complexity.
IUC, this PR still leaves the retrieval completely unthrottled on the client side.
This approach makes it impossible to unify throttling under a single "SimultaneousTransfers" configuration parameter.
- The user now has to think about requestor-side and responder-side throttles individually.
- This is not very useful, as I suspect that in most cases folks will want to apply global throttles (in fact, this was the advertised contract of SimultaneousTransfers on the Lotus side!).
- That said, for advanced use cases it's useful to retain the ability to adjust each side individiually.

hannahhoward · 2021-08-26T17:10:06Z

I would personally prefer not to move to go-graphsync, at least for now.

The underlying design principle contained in the PR description but not spelled out clearly: graphsync actions initiated by a user should not be rate limited, while those initiated by a remote peer should.

If I am a client, and I initiate 100 retrievals at the same time to 100 different miners (i.e. outgoing graphsync requests), presumably, I want them to finish as quickly as possible. I probably don't want that rate limited -- if I want to use all my CPU / RAM resources to do it, that's my choice. And, again, this isn't just for retrievals in Filecoin -- presumably Graphsync could be running in IPFS or any Libp2p context.

The Data Transfer Push is a unique use of Graphsync. In this case, the outgoing Graphsync request is NOT initiated by a user but by an automated approval of a data transfer push request from a remote peer. Another peer can monopolize your CPU/RAM resources by sending you lots of data transfer Push requests.

A client initiating Pull Requests (i.e. a retrieval client) should have the ability to trigger as many requests as possible without being rate limited. On the other hand, a provider who is responding incoming Push requests by triggering an outgoing graphsync request should be rate limited. (Arguably, this could be resolved in Filecoin with the code implementing in go-graphsync since provider and client are always different processes with their own graphsync instances)

Ultimately, I think Graphsync needs a Push mode -- there's really no reason we should need to have this extra libp2p protocol and roundtrip. At that point it would make sense to do things inside of go-graphsync, cause then we can rate limit incoming Push requests while not rating limiting outgoing Pull requests. That's an undertaking. If we want to allocate time to doing this and #244 (sort of a prerequisite), I'm definitely game.

In the meantime, this is road that stays consistent with the underlying design principle: actions initiated by a user should not be rate limited, but actions initiated by a remote peer should.

SimultaneousTransfers is a Lotus level config and can still be used a single value at the Lotus level -- and passed to the instantiations of graphsync & the go-data-transfer graphsync transport . In fact, that's the intent here.

raulk · 2021-08-26T17:35:19Z

The main reason why I think that this throttling has its home in go-graphsync is that I think of go-graphsync as an autonomous engine that receives commands to transfer data out or in, whether from the network or the user, it's not a concern of the library.

Being an autonomous, fully asynchronous data exchange agent means that it needs to look after its own health and progress. Part of that is observing throttling limits imposed by its environment.

I also don't think a low-level composable library like go-graphsync should make assumptions about who and how it's going to be driven or called (by programmatic behaviour or user action).

If we place this logic outside of go-graphsync, we are forcing every user to implement the throttling logic outside (I can also imagine go-ipfs or any other "user-driven" application wanting to apply some limit to gracefully handle an avalanche of user requests), so essentially we are obliging all downstream users to re-implement this concern.

SimultaneousTransfers is a Lotus level config and can still be used a single value at the Lotus level -- and passed to the instantiations of graphsync & the go-data-transfer graphsync transport . In fact, that's the intent here.

What I mean to say is that the original contract of SimultaneousTransfers is that it limited all transfers (inbound or outbound). How do you suggest we apply that single, global value to two different knobs (one in graphsync, one in data-transfer)?

hannahhoward requested review from dirkmc, aarshkshah1992 and masih August 25, 2021 00:08

aarshkshah1992 reviewed Aug 25, 2021

View reviewed changes

dirkmc reviewed Aug 25, 2021

View reviewed changes

hannahhoward added 7 commits August 25, 2021 17:11

feat(requestqueue): design a request queue

f3be011

add a request queue designed to execute push-graphsync requests in a rate limited fashion

feat(graphsync): integrate request queue

290c285

integrate request queue into graphsync transport and demonstrate push request rate limiting

fix(go): update golang dep

a5f4a40

fix(transport): handle restarts properly

21695d6

put restarts into the queue, ahead of new requests, and don't duplicate for pending

fix(impl): improve reliability of push request rate limit test

64e09b5

style(graphsync): add comments on pause

ce976ac

refactor(graphsync): abstract deferred requests

2707216

make request queue more abstract by accepting an interface and do cancellations on inprogress requests right away

hannahhoward force-pushed the feat/limit-push-graphsyncs branch from b2cc5c7 to 2707216 Compare August 26, 2021 00:11

aarshkshah1992 reviewed Aug 26, 2021

View reviewed changes

dirkmc reviewed Aug 26, 2021

View reviewed changes

raulk reviewed Aug 26, 2021

View reviewed changes

hannahhoward mentioned this pull request Sep 1, 2021

Add SimultaneousTransfers limitation for incoming data transfers from Storage Deals filecoin-project/lotus#7030

Closed

hannahhoward closed this Oct 26, 2021

Limit Push Graphsync requests #243

Limit Push Graphsync requests #243

Conversation

hannahhoward commented Aug 24, 2021

Goals

Implementation

For discussion

codecov-commenter commented Aug 24, 2021 • edited Loading

Codecov Report

aarshkshah1992 left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aarshkshah1992 Aug 25, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aarshkshah1992 Aug 25, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hannahhoward commented Aug 25, 2021 • edited Loading

hannahhoward commented Aug 26, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

raulk left a comment • edited Loading

Choose a reason for hiding this comment

hannahhoward commented Aug 26, 2021 • edited Loading

raulk commented Aug 26, 2021 • edited Loading

codecov-commenter commented Aug 24, 2021 •

edited

Loading

aarshkshah1992 left a comment •

edited

Loading

aarshkshah1992 Aug 25, 2021 •

edited

Loading

aarshkshah1992 Aug 25, 2021 •

edited

Loading

hannahhoward commented Aug 25, 2021 •

edited

Loading

raulk left a comment •

edited

Loading

hannahhoward commented Aug 26, 2021 •

edited

Loading

raulk commented Aug 26, 2021 •

edited

Loading