Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

It's difficult to understand if a very long timeout in fetch could stall the extraction worker's progress #23

Closed
TimDaub opened this issue Jun 29, 2022 · 5 comments
Labels
bug Something isn't working help wanted Extra attention is needed question Further information is requested

Comments

@TimDaub
Copy link
Collaborator

TimDaub commented Jun 29, 2022

  • consider this: We implement better-queue with a configurable concurrency parameter
  • But in some cases, a timeout in fetch can take up to 300 seconds (as implemented in Chrome): https://www.benmvp.com/blog/quickie-fetch-timeout/
  • So if e.g. we set concurrency to 200, then the following effect could occur:
    • we crawl with 200 requests per seconds concurrently, but occasionally a request takes 300 secs to resolve and so it takes 1 spot in the queue for 300 seconds.
    • If this happens often, suddenly we could have 200 requests taking 300 seconds to resolve and so practically we're not making requests to healthy endpoints anymore and the extraction-worker stalls
  • for now, this is merely a suspicion I have and e.g. fixing It's hard to debug how many requests per second extraction worker does #22 would help to understand if this really happens
  • E.g. we could allow the user to configure a MAX_TIMEOUT that stops a request from the queue after some time
  • Later, we could even start calculating a "healthy request time metric" that'd allow us to remove unhealthy-looking requests from the queue.

/cc @il3ven

@TimDaub TimDaub added bug Something isn't working help wanted Extra attention is needed question Further information is requested labels Jun 29, 2022
@TimDaub
Copy link
Collaborator Author

TimDaub commented Jun 29, 2022

Debugging journal:

  • I've now enabled debugging by logging queue.getStats() which is actually super helpful here
  • It's exactly as I expected where the average task completion time goes up a lot over the time of requesting data
  • At first it's just a few milli seconds and then (just before the first fetch timeouts happen), it averages 45s per request (which is huge).
2022-06-29T09:44:40.703Z neume-network-extraction-worker:worker {"successRate":0.9954193093727978,"peak":29703,"average":45504.534531360114,"total":5676}

@TimDaub
Copy link
Collaborator Author

TimDaub commented Jun 29, 2022

Found that there's potentially a problem with better-queue's maxTimeout: diamondio/better-queue#81

@TimDaub
Copy link
Collaborator Author

TimDaub commented Jun 30, 2022

@il3ven
Copy link
Collaborator

il3ven commented Jul 3, 2022

I did an experiment. I pushed 6 tasks to the queue. The second task should take a very long time. I found that the second task did not stall the queue if the concurrency was greater than 1.

The above makes sense. We can imagine it like this. With concurrency equal to two we have two workers that can execute our tasks in parallel. If one of the worker gets blocked due to a long task the other worker can keep on executing the tasks.

Screen.Recording.2022-07-03.at.11.33.20.PM.mov

@TimDaub
Copy link
Collaborator Author

TimDaub commented Jul 4, 2022

If one of the worker gets blocked due to a long task the other worker can keep on executing the tasks.

yes, but I'm outlining the problem where we potentially have a concurrency of e.g. 200 parallel workers and then over time while all non-problematic tasks aren't blocking the queue, there are a total of > 200 tasks that can clog up the queue. Think about it this way: We have 20000 tasks to execute but only 200 tasks that take e.g. 5mins to clear, then if those 200 bad tasks are spread over those 20000 good tasks, we have a good chance that the queue is clogged up and not running at full concurrency all the time. Hence further allowing to configure timeouts to more efficiently ending uneconomic tasks can be a good thing.

@TimDaub TimDaub closed this as completed in 66196ea Jul 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants