Reindexing - Works are processed multiple times #2256

paul-butcher · 2022-11-18T10:53:55Z

Having recently run a catalogue reindex (2022-11-17), this is the output of python get_reindex_status.py 2022-11-17 before redriving any DLQs.

source records      2,763,355
works-source        2,763,340  ▼ 15
works-identified    2,763,340
works-merged        2,764,855
works-denormalised  2,762,428  ▼ 2,427
works-indexed       2,755,508  ▼ 6,920

*** Image index stats ***
images-initial    144,259
images-augmented  144,258  ▼ 1
images-indexed    144,258

*** SQS stats ***
The following queues still have messages:
ingestor_works_input     1,668,305
matcher_input                    5
relation_embedder_input     24,545

*** DLQ stats ***
The following DLQs have failed messages:
batcher_input_dlq                 2,364
image_inferrer_input_dlq              2
ingestor_works_input_dlq      1,765,448
transformer_sierra_input_dlq          7
transformer_tei_input_dlq            13

Approximately 99% of records have been reindexed successfully (counts may not be exact)

Counterintuitively, although 99% of the nearly 3 million records (2.76 works, 0.14 images), have been "reindexed successfully", over 3 million are still to go (there are still 1.67 million left to index, and 1.76 million that have failed to index).

The text was updated successfully, but these errors were encountered:

alexwlchan · 2022-11-18T10:55:29Z

Yeah, I think this is the relation embedder – if you have e.g. 10 works in a tree, but they all get processed in different batches, then you end up sending 1 + 2 + … + 10 = 55 works to the final ingestor. 😬

paul-butcher · 2022-11-18T11:20:03Z

This is (I believe) because of the relation embedder. A work will go all the way through the pipeline, then one of its relations will go through, causing that first one to be reprocessed (from relation embedder down), then another relative, and so on.

This could be resolved with a more DAG-like approach during a reindex. All queues upstream of the relation embedder subsystem should be empty before it starts embedding relations.

Although this introduces a bottleneck at that point, I believe that it would actually speed the pipeline up, because this repeat processing will be removed.

At present, the progress of n% of records having been processed is somewhat illusory. The right number of records have made it through the system, but some of them are incomplete.

paul-butcher · 2022-11-18T11:30:16Z

I suspect that if the batcher runs with a shorter timeout, and fewer instances, it would mitigate both this problem and the one present in #2255.

The reason a work goes through relation embedder multiple times is because it is present in multiple batches.

The reason it is in multiple batches output by the batcher, is because it is relevant to multiple batches of messages pulled from the batcher input queue.

There are two reasons for it to be in multiple batches of messages:

it is in one batch of 100k, and its relations are in another batch of 100k
the batcher is highly fragmented, and is only processing 10k at a time

paul-butcher · 2022-11-18T13:40:18Z

The shorter timeout (5 minutes) I added to fix #2255 leads to greater fragmentation, as the processes upstream of the batcher do not seem to produce anywhere near 100,000 records in 5 minutes.

I suspect that the only way to truly resolve this, without resorting to unacceptably long timeouts is the DAG approach.

paul-butcher · 2022-11-18T14:06:40Z

I started a new reindex at about 1315.

With the new settings, it has reached 33% in about 30 minutes.

Let's see how long the remaining two thirds takes. These times for different fractions will give a clue as to whether the multiple processing of works I the relation embedder is significant.

50% in 45 minutes.

75% in a little under 80 minutes.

83% in 90 minutes

99% in 105 minutes, with about 5000 left on relation_embedder_input, and 400,000 on ingestor

paul-butcher · 2022-11-18T15:10:57Z

It has reached 100% with the queues looking like this:

ingestor_works_input     417,609
merger_input                   4
relation_embedder_input    4,434

paul-butcher · 2022-11-18T15:31:50Z

From about 2 hours, the only things upstream of ingestor_works_input appear to be regular "normal running" records coming through the system, not part of the reindex. However, the works ingestor has started to fail.

at about 2hrs 15 (135 minutes):

*** SQS stats ***
The following queues still have messages:
ingestor_works_input      189,857
merger_input                    2
router_input                    4
transformer_sierra_input       23

*** DLQ stats ***
The following DLQs have failed messages:
image_inferrer_input_dlq            2
ingestor_works_input_dlq      132,181
transformer_sierra_input_dlq        7
transformer_tei_input_dlq          13

paul-butcher · 2022-11-21T10:36:44Z

I ran the pipeline with one batcher with a 5 minute interval, these are the batch sizes of paths that it processed:

So, after a bit of a "warm up", it looks like about 30k per five minutes. I think 5 minutes is a bit too fragmented, but it's evidence that the 45 minute timeout is much too long. Looking at the pattern above, I suspect that 25 minutes is a good interval (first run about 50k, then it should carry on hitting 120k until upstream runs out of steam).

This is consistent with there being roughly 0.5 million records that go through the batcher, processing them in in bundles of 120k means running it 5 times at best, (probably 6: 4* 120K + 1 either side), and the whole pipeline takes a bit over 2 hours.

25 minutes should not be a terrible time to wait at the end.

paul-butcher mentioned this issue Nov 18, 2022

Reindex freezes in the batcher near end of processing. #2255

Closed

paul-butcher mentioned this issue Nov 21, 2022

Fix: Batcher freeze #2260

Merged

paul-butcher mentioned this issue Oct 4, 2023

Deduplicate messages on the ingestor queue #2459

Open

paul-butcher mentioned this issue Feb 9, 2024

Unexpected discrepancy in image counts between 2024-01-09 and 2024-02-01 pipelines #2549

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reindexing - Works are processed multiple times #2256

Reindexing - Works are processed multiple times #2256

paul-butcher commented Nov 18, 2022 •

edited

Loading

alexwlchan commented Nov 18, 2022

paul-butcher commented Nov 18, 2022

paul-butcher commented Nov 18, 2022 •

edited

Loading

paul-butcher commented Nov 18, 2022

paul-butcher commented Nov 18, 2022 •

edited

Loading

paul-butcher commented Nov 18, 2022

paul-butcher commented Nov 18, 2022

paul-butcher commented Nov 21, 2022 •

edited

Loading

Reindexing - Works are processed multiple times #2256

Reindexing - Works are processed multiple times #2256

Comments

paul-butcher commented Nov 18, 2022 • edited Loading

alexwlchan commented Nov 18, 2022

paul-butcher commented Nov 18, 2022

paul-butcher commented Nov 18, 2022 • edited Loading

paul-butcher commented Nov 18, 2022

paul-butcher commented Nov 18, 2022 • edited Loading

paul-butcher commented Nov 18, 2022

paul-butcher commented Nov 18, 2022

paul-butcher commented Nov 21, 2022 • edited Loading

paul-butcher commented Nov 18, 2022 •

edited

Loading

paul-butcher commented Nov 18, 2022 •

edited

Loading

paul-butcher commented Nov 18, 2022 •

edited

Loading

paul-butcher commented Nov 21, 2022 •

edited

Loading