Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reindex freezes in the batcher near end of processing. #2255

Closed
paul-butcher opened this issue Nov 18, 2022 · 7 comments
Closed

Reindex freezes in the batcher near end of processing. #2255

paul-butcher opened this issue Nov 18, 2022 · 7 comments
Assignees

Comments

@paul-butcher
Copy link
Contributor

Yesterday (2022-11-17), I ran a catalogue reindex.

The batcher input queue appeared to stick at about 430,000 for over two hours. Around 2.5 million records had been processed at this point.

I scaled the pipeline down to normal-running mode, and it cleared the blockage (however, the now-smaller ingestor then began to fail)

@paul-butcher
Copy link
Contributor Author

paul-butcher commented Nov 18, 2022

Initially, I thought that it was just down to the long (45 minute) timeout, and assumed it would all clear up eventually, but then I came back after a 1.5 hour meeting to discover that the batcher queue had not moved in over two hours. I also then realised that the batch size is 100,000, so it should have hit that limit.

https://wellcome.slack.com/archives/C3TQSF63C/p1668692622476229

@paul-butcher
Copy link
Contributor Author

paul-butcher commented Nov 18, 2022

I still suspect the timeout to be the problem, partly because it was resolved by scaling down.

Normal-running mode has much less computer behind it, and therefore sets the timeouts and batch sizes to something much smaller.

I believe this timeout is too long anyway, I think it processes a batch roughly every 4.5 minutes during a reindex (until this freeze), rather than 45 minutes.

I wonder if, in reindex mode, the batcher is running 5 (or more) instances in parallel, so that, on average, each of them pulled 86000 from the queue and then sat around for 45 minutes, before putting them all back and pulling another 86000. Given 5 retries, this could take a long old time before truly failing or for one of them to be fortunate enough to pull an extra 14k and run.

@paul-butcher
Copy link
Contributor Author

paul-butcher commented Nov 18, 2022

I'm pretty certain of that now. And it's even more interesting!

According to this note, the "SQS in flight limit is 120k".

According to this setting: we are running up to 12 batchers in parallel (I think).

So I suspect that as long as the batcher keeps up with the input, everything is fine. The pattern is something like this:

batcher1: pull 100,000
batcher2: pull 20,000
batcher1: finish 100,000
batcher2: pull 80,000
batcher1: pull 40,000
batcher2: finish 100,000

and so on.

But eventually, it hits a scenario where the 120,000 limit is hit before any one hits 100,000. If the pull rate is spread evenly across all the batchers, then we hit something like this and it gets stuck.

1: 100,000
2-12: ca. 1800 each
1: finish
1: ca. 8000
2-12: 2600 each

However, I would still expect:
45 minutes later:
120,000 finish
1-12: 10k each

@pollecuttn pollecuttn moved this to Backlog in Digital platform Nov 18, 2022
@pollecuttn pollecuttn moved this from Backlog to In Progress in Digital platform Nov 18, 2022
@paul-butcher
Copy link
Contributor Author

Queue visibility is five minutes more than flush interval which should be long enough to go through

@paul-butcher
Copy link
Contributor Author

Some discussion here:
https://wellcome.slack.com/archives/C3TQSF63C/p1668767687778159?thread_ts=1668766836.854159&cid=C3TQSF63C

There is a related problem here: #2256

The point of the batcher is to minimise the number of times any given Work gets indexed, which is why it pulls so many messages off the queue at once.

@paul-butcher
Copy link
Contributor Author

paul-butcher commented Nov 18, 2022

Given this example from above,

batcher1: pull 100,000
batcher2: pull 20,000
batcher1: finish 100,000
batcher2: pull 80,000
batcher1: pull 40,000
batcher2: finish 100,000

I think it is only ever processing one load of 100,000 at a time, because it can't pull enough from the queue to fill two loads.

Even with only two instances, the second instance would always just be sitting around holding on to the (120k - n) records left over from however many the first one needed to reach 100k, until eventually it wins, and the roles are reversed.

@paul-butcher
Copy link
Contributor Author

Having run a reindex with the timeout set to 5 minutes and only one batcher, no "freeze" occurred. However, I think the batching was a bit too fragmented (over to #2256)

@paul-butcher paul-butcher moved this from In Progress to Ready for review in Digital platform Nov 22, 2022
@paul-butcher paul-butcher moved this from Ready for review to Done in Digital platform Nov 22, 2022
@pollecuttn pollecuttn moved this from Done to Archive in Digital platform Nov 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant