It's difficult to understand if a very long timeout in `fetch` could stall the extraction worker's progress #23

TimDaub · 2022-06-29T08:38:09Z

consider this: We implement better-queue with a configurable concurrency parameter
But in some cases, a timeout in fetch can take up to 300 seconds (as implemented in Chrome): https://www.benmvp.com/blog/quickie-fetch-timeout/
So if e.g. we set concurrency to 200, then the following effect could occur:
- we crawl with 200 requests per seconds concurrently, but occasionally a request takes 300 secs to resolve and so it takes 1 spot in the queue for 300 seconds.
- If this happens often, suddenly we could have 200 requests taking 300 seconds to resolve and so practically we're not making requests to healthy endpoints anymore and the extraction-worker stalls
for now, this is merely a suspicion I have and e.g. fixing It's hard to debug how many requests per second extraction worker does #22 would help to understand if this really happens
E.g. we could allow the user to configure a MAX_TIMEOUT that stops a request from the queue after some time
Later, we could even start calculating a "healthy request time metric" that'd allow us to remove unhealthy-looking requests from the queue.

TimDaub · 2022-06-29T09:46:37Z

Debugging journal:

I've now enabled debugging by logging queue.getStats() which is actually super helpful here
It's exactly as I expected where the average task completion time goes up a lot over the time of requesting data
At first it's just a few milli seconds and then (just before the first fetch timeouts happen), it averages 45s per request (which is huge).

2022-06-29T09:44:40.703Z neume-network-extraction-worker:worker {"successRate":0.9954193093727978,"peak":29703,"average":45504.534531360114,"total":5676}

TimDaub · 2022-06-29T10:21:13Z

Found that there's potentially a problem with better-queue's maxTimeout: diamondio/better-queue#81

TimDaub · 2022-06-30T08:55:06Z

[email protected] allows managing timeouts on requests https://github.com/rugpullindex/eth-fun/blob/master/CHANGELOG.md#080
Via message-schema we should allow a timeout option: A user cannot define a timeout for a workermessage currently and some messages can stall a worker message-schema#18, extraction-worker should then timeout individual requests

il3ven · 2022-07-03T18:11:43Z

I did an experiment. I pushed 6 tasks to the queue. The second task should take a very long time. I found that the second task did not stall the queue if the concurrency was greater than 1.

The above makes sense. We can imagine it like this. With concurrency equal to two we have two workers that can execute our tasks in parallel. If one of the worker gets blocked due to a long task the other worker can keep on executing the tasks.

Screen.Recording.2022-07-03.at.11.33.20.PM.mov

TimDaub · 2022-07-04T13:32:02Z

If one of the worker gets blocked due to a long task the other worker can keep on executing the tasks.

yes, but I'm outlining the problem where we potentially have a concurrency of e.g. 200 parallel workers and then over time while all non-problematic tasks aren't blocking the queue, there are a total of > 200 tasks that can clog up the queue. Think about it this way: We have 20000 tasks to execute but only 200 tasks that take e.g. 5mins to clear, then if those 200 bad tasks are spread over those 20000 good tasks, we have a good chance that the queue is clogged up and not running at full concurrency all the time. Hence further allowing to configure timeouts to more efficiently ending uneconomic tasks can be a good thing.

TimDaub added bug Something isn't working help wanted Extra attention is needed question Further information is requested labels Jun 29, 2022

TimDaub mentioned this issue Jun 29, 2022

crawl suddenly takes ages neume-network/data#9

Closed

TimDaub closed this as completed in 66196ea Jul 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

It's difficult to understand if a very long timeout in `fetch` could stall the extraction worker's progress #23

It's difficult to understand if a very long timeout in `fetch` could stall the extraction worker's progress #23

TimDaub commented Jun 29, 2022

TimDaub commented Jun 29, 2022

TimDaub commented Jun 29, 2022

TimDaub commented Jun 30, 2022

il3ven commented Jul 3, 2022

TimDaub commented Jul 4, 2022

It's difficult to understand if a very long timeout in fetch could stall the extraction worker's progress #23

It's difficult to understand if a very long timeout in fetch could stall the extraction worker's progress #23

Comments

TimDaub commented Jun 29, 2022

TimDaub commented Jun 29, 2022

TimDaub commented Jun 29, 2022

TimDaub commented Jun 30, 2022

il3ven commented Jul 3, 2022

TimDaub commented Jul 4, 2022

It's difficult to understand if a very long timeout in `fetch` could stall the extraction worker's progress #23

It's difficult to understand if a very long timeout in `fetch` could stall the extraction worker's progress #23