Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reindexing - Works are processed multiple times #2256

Open
paul-butcher opened this issue Nov 18, 2022 · 8 comments
Open

Reindexing - Works are processed multiple times #2256

paul-butcher opened this issue Nov 18, 2022 · 8 comments

Comments

@paul-butcher
Copy link
Contributor

paul-butcher commented Nov 18, 2022

Having recently run a catalogue reindex (2022-11-17), this is the output of python get_reindex_status.py 2022-11-17 before redriving any DLQs.

source records      2,763,355
works-source        2,763,340  ▼ 15
works-identified    2,763,340
works-merged        2,764,855
works-denormalised  2,762,428  ▼ 2,427
works-indexed       2,755,508  ▼ 6,920

*** Image index stats ***
images-initial    144,259
images-augmented  144,258  ▼ 1
images-indexed    144,258

*** SQS stats ***
The following queues still have messages:
ingestor_works_input     1,668,305
matcher_input                    5
relation_embedder_input     24,545

*** DLQ stats ***
The following DLQs have failed messages:
batcher_input_dlq                 2,364
image_inferrer_input_dlq              2
ingestor_works_input_dlq      1,765,448
transformer_sierra_input_dlq          7
transformer_tei_input_dlq            13

Approximately 99% of records have been reindexed successfully (counts may not be exact)

Counterintuitively, although 99% of the nearly 3 million records (2.76 works, 0.14 images), have been "reindexed successfully", over 3 million are still to go (there are still 1.67 million left to index, and 1.76 million that have failed to index).

@alexwlchan
Copy link
Contributor

Yeah, I think this is the relation embedder – if you have e.g. 10 works in a tree, but they all get processed in different batches, then you end up sending 1 + 2 + … + 10 = 55 works to the final ingestor. 😬

@paul-butcher
Copy link
Contributor Author

This is (I believe) because of the relation embedder. A work will go all the way through the pipeline, then one of its relations will go through, causing that first one to be reprocessed (from relation embedder down), then another relative, and so on.

This could be resolved with a more DAG-like approach during a reindex. All queues upstream of the relation embedder subsystem should be empty before it starts embedding relations.

Although this introduces a bottleneck at that point, I believe that it would actually speed the pipeline up, because this repeat processing will be removed.

At present, the progress of n% of records having been processed is somewhat illusory. The right number of records have made it through the system, but some of them are incomplete.

@paul-butcher
Copy link
Contributor Author

paul-butcher commented Nov 18, 2022

I suspect that if the batcher runs with a shorter timeout, and fewer instances, it would mitigate both this problem and the one present in #2255.

The reason a work goes through relation embedder multiple times is because it is present in multiple batches.

The reason it is in multiple batches output by the batcher, is because it is relevant to multiple batches of messages pulled from the batcher input queue.

There are two reasons for it to be in multiple batches of messages:

  1. it is in one batch of 100k, and its relations are in another batch of 100k
  2. the batcher is highly fragmented, and is only processing 10k at a time

@paul-butcher
Copy link
Contributor Author

The shorter timeout (5 minutes) I added to fix #2255 leads to greater fragmentation, as the processes upstream of the batcher do not seem to produce anywhere near 100,000 records in 5 minutes.

I suspect that the only way to truly resolve this, without resorting to unacceptably long timeouts is the DAG approach.

@paul-butcher
Copy link
Contributor Author

paul-butcher commented Nov 18, 2022

I started a new reindex at about 1315.

With the new settings, it has reached 33% in about 30 minutes.

Let's see how long the remaining two thirds takes. These times for different fractions will give a clue as to whether the multiple processing of works I the relation embedder is significant.

50% in 45 minutes.

75% in a little under 80 minutes.

83% in 90 minutes

99% in 105 minutes, with about 5000 left on relation_embedder_input, and 400,000 on ingestor

@paul-butcher
Copy link
Contributor Author

It has reached 100% with the queues looking like this:

ingestor_works_input     417,609
merger_input                   4
relation_embedder_input    4,434

@paul-butcher
Copy link
Contributor Author

From about 2 hours, the only things upstream of ingestor_works_input appear to be regular "normal running" records coming through the system, not part of the reindex. However, the works ingestor has started to fail.

at about 2hrs 15 (135 minutes):

*** SQS stats ***
The following queues still have messages:
ingestor_works_input      189,857
merger_input                    2
router_input                    4
transformer_sierra_input       23

*** DLQ stats ***
The following DLQs have failed messages:
image_inferrer_input_dlq            2
ingestor_works_input_dlq      132,181
transformer_sierra_input_dlq        7
transformer_tei_input_dlq          13

@paul-butcher
Copy link
Contributor Author

paul-butcher commented Nov 21, 2022

I ran the pipeline with one batcher with a 5 minute interval, these are the batch sizes of paths that it processed:

7
619
3335
15162
30310
36116
41261
46790
49485
46820
31375
16917
16779
17384
17317
18571
23330
21437
15113

So, after a bit of a "warm up", it looks like about 30k per five minutes. I think 5 minutes is a bit too fragmented, but it's evidence that the 45 minute timeout is much too long. Looking at the pattern above, I suspect that 25 minutes is a good interval (first run about 50k, then it should carry on hitting 120k until upstream runs out of steam).

This is consistent with there being roughly 0.5 million records that go through the batcher, processing them in in bundles of 120k means running it 5 times at best, (probably 6: 4* 120K + 1 either side), and the whole pipeline takes a bit over 2 hours.

25 minutes should not be a terrible time to wait at the end.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants