Reindex freezes in the batcher near end of processing. #2255

paul-butcher · 2022-11-18T09:39:54Z

Yesterday (2022-11-17), I ran a catalogue reindex.

The batcher input queue appeared to stick at about 430,000 for over two hours. Around 2.5 million records had been processed at this point.

I scaled the pipeline down to normal-running mode, and it cleared the blockage (however, the now-smaller ingestor then began to fail)

paul-butcher · 2022-11-18T09:43:28Z

Initially, I thought that it was just down to the long (45 minute) timeout, and assumed it would all clear up eventually, but then I came back after a 1.5 hour meeting to discover that the batcher queue had not moved in over two hours. I also then realised that the batch size is 100,000, so it should have hit that limit.

https://wellcome.slack.com/archives/C3TQSF63C/p1668692622476229

paul-butcher · 2022-11-18T09:50:28Z

I still suspect the timeout to be the problem, partly because it was resolved by scaling down.

Normal-running mode has much less computer behind it, and therefore sets the timeouts and batch sizes to something much smaller.

I believe this timeout is too long anyway, I think it processes a batch roughly every 4.5 minutes during a reindex (until this freeze), rather than 45 minutes.

I wonder if, in reindex mode, the batcher is running 5 (or more) instances in parallel, so that, on average, each of them pulled 86000 from the queue and then sat around for 45 minutes, before putting them all back and pulling another 86000. Given 5 retries, this could take a long old time before truly failing or for one of them to be fortunate enough to pull an extra 14k and run.

paul-butcher · 2022-11-18T10:11:56Z

I'm pretty certain of that now. And it's even more interesting!

According to this note, the "SQS in flight limit is 120k".

According to this setting: we are running up to 12 batchers in parallel (I think).

So I suspect that as long as the batcher keeps up with the input, everything is fine. The pattern is something like this:

batcher1: pull 100,000
batcher2: pull 20,000
batcher1: finish 100,000
batcher2: pull 80,000
batcher1: pull 40,000
batcher2: finish 100,000

and so on.

But eventually, it hits a scenario where the 120,000 limit is hit before any one hits 100,000. If the pull rate is spread evenly across all the batchers, then we hit something like this and it gets stuck.

1: 100,000
2-12: ca. 1800 each
1: finish
1: ca. 8000
2-12: 2600 each

However, I would still expect:
45 minutes later:
120,000 finish
1-12: 10k each

paul-butcher · 2022-11-18T10:30:12Z

Queue visibility is five minutes more than flush interval which should be long enough to go through

paul-butcher · 2022-11-18T11:28:28Z

Some discussion here:
https://wellcome.slack.com/archives/C3TQSF63C/p1668767687778159?thread_ts=1668766836.854159&cid=C3TQSF63C

There is a related problem here: #2256

The point of the batcher is to minimise the number of times any given Work gets indexed, which is why it pulls so many messages off the queue at once.

paul-butcher · 2022-11-18T11:44:16Z

Given this example from above,

batcher1: pull 100,000
batcher2: pull 20,000
batcher1: finish 100,000
batcher2: pull 80,000
batcher1: pull 40,000
batcher2: finish 100,000

I think it is only ever processing one load of 100,000 at a time, because it can't pull enough from the queue to fill two loads.

Even with only two instances, the second instance would always just be sitting around holding on to the (120k - n) records left over from however many the first one needed to reach 100k, until eventually it wins, and the roles are reversed.

paul-butcher · 2022-11-21T10:15:00Z

Having run a reindex with the timeout set to 5 minutes and only one batcher, no "freeze" occurred. However, I think the batching was a bit too fragmented (over to #2256)

pollecuttn added this to Digital platform Nov 18, 2022

pollecuttn moved this to Backlog in Digital platform Nov 18, 2022

pollecuttn moved this from Backlog to In Progress in Digital platform Nov 18, 2022

pollecuttn assigned paul-butcher Nov 18, 2022

paul-butcher mentioned this issue Nov 18, 2022

Reindexing - Works are processed multiple times #2256

Open

paul-butcher mentioned this issue Nov 21, 2022

Fix: Batcher freeze #2260

Merged

paul-butcher moved this from In Progress to Ready for review in Digital platform Nov 22, 2022

paul-butcher moved this from Ready for review to Done in Digital platform Nov 22, 2022

paul-butcher closed this as completed Nov 22, 2022

pollecuttn moved this from Done to Archive in Digital platform Nov 23, 2022

pollecuttn removed this from Digital platform May 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reindex freezes in the batcher near end of processing. #2255

Reindex freezes in the batcher near end of processing. #2255

paul-butcher commented Nov 18, 2022

paul-butcher commented Nov 18, 2022 •

edited

Loading

paul-butcher commented Nov 18, 2022 •

edited

Loading

paul-butcher commented Nov 18, 2022 •

edited

Loading

paul-butcher commented Nov 18, 2022

paul-butcher commented Nov 18, 2022

paul-butcher commented Nov 18, 2022 •

edited

Loading

paul-butcher commented Nov 21, 2022

Reindex freezes in the batcher near end of processing. #2255

Reindex freezes in the batcher near end of processing. #2255

Comments

paul-butcher commented Nov 18, 2022

paul-butcher commented Nov 18, 2022 • edited Loading

paul-butcher commented Nov 18, 2022 • edited Loading

paul-butcher commented Nov 18, 2022 • edited Loading

paul-butcher commented Nov 18, 2022

paul-butcher commented Nov 18, 2022

paul-butcher commented Nov 18, 2022 • edited Loading

paul-butcher commented Nov 21, 2022

paul-butcher commented Nov 18, 2022 •

edited

Loading

paul-butcher commented Nov 18, 2022 •

edited

Loading

paul-butcher commented Nov 18, 2022 •

edited

Loading

paul-butcher commented Nov 18, 2022 •

edited

Loading