-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reindex freezes in the batcher near end of processing. #2255
Comments
Initially, I thought that it was just down to the long (45 minute) timeout, and assumed it would all clear up eventually, but then I came back after a 1.5 hour meeting to discover that the batcher queue had not moved in over two hours. I also then realised that the batch size is 100,000, so it should have hit that limit. https://wellcome.slack.com/archives/C3TQSF63C/p1668692622476229 |
I still suspect the timeout to be the problem, partly because it was resolved by scaling down. Normal-running mode has much less computer behind it, and therefore sets the timeouts and batch sizes to something much smaller. I believe this timeout is too long anyway, I think it processes a batch roughly every 4.5 minutes during a reindex (until this freeze), rather than 45 minutes. I wonder if, in reindex mode, the batcher is running 5 (or more) instances in parallel, so that, on average, each of them pulled 86000 from the queue and then sat around for 45 minutes, before putting them all back and pulling another 86000. Given 5 retries, this could take a long old time before truly failing or for one of them to be fortunate enough to pull an extra 14k and run. |
I'm pretty certain of that now. And it's even more interesting! According to this note, the "SQS in flight limit is 120k". According to this setting: we are running up to 12 batchers in parallel (I think). So I suspect that as long as the batcher keeps up with the input, everything is fine. The pattern is something like this: batcher1: pull 100,000 and so on. But eventually, it hits a scenario where the 120,000 limit is hit before any one hits 100,000. If the pull rate is spread evenly across all the batchers, then we hit something like this and it gets stuck. 1: 100,000 However, I would still expect: |
Queue visibility is five minutes more than flush interval which should be long enough to go through |
Some discussion here: There is a related problem here: #2256 The point of the batcher is to minimise the number of times any given Work gets indexed, which is why it pulls so many messages off the queue at once. |
Given this example from above,
I think it is only ever processing one load of 100,000 at a time, because it can't pull enough from the queue to fill two loads. Even with only two instances, the second instance would always just be sitting around holding on to the (120k - n) records left over from however many the first one needed to reach 100k, until eventually it wins, and the roles are reversed. |
Having run a reindex with the timeout set to 5 minutes and only one batcher, no "freeze" occurred. However, I think the batching was a bit too fragmented (over to #2256) |
Yesterday (2022-11-17), I ran a catalogue reindex.
The batcher input queue appeared to stick at about 430,000 for over two hours. Around 2.5 million records had been processed at this point.
I scaled the pipeline down to normal-running mode, and it cleared the blockage (however, the now-smaller ingestor then began to fail)
The text was updated successfully, but these errors were encountered: