-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reindexing - Works are processed multiple times #2256
Comments
Yeah, I think this is the relation embedder – if you have e.g. 10 works in a tree, but they all get processed in different batches, then you end up sending 1 + 2 + … + 10 = 55 works to the final ingestor. 😬 |
This is (I believe) because of the relation embedder. A work will go all the way through the pipeline, then one of its relations will go through, causing that first one to be reprocessed (from relation embedder down), then another relative, and so on. This could be resolved with a more DAG-like approach during a reindex. All queues upstream of the relation embedder subsystem should be empty before it starts embedding relations. Although this introduces a bottleneck at that point, I believe that it would actually speed the pipeline up, because this repeat processing will be removed. At present, the progress of n% of records having been processed is somewhat illusory. The right number of records have made it through the system, but some of them are incomplete. |
I suspect that if the batcher runs with a shorter timeout, and fewer instances, it would mitigate both this problem and the one present in #2255. The reason a work goes through relation embedder multiple times is because it is present in multiple batches. The reason it is in multiple batches output by the batcher, is because it is relevant to multiple batches of messages pulled from the batcher input queue. There are two reasons for it to be in multiple batches of messages:
|
The shorter timeout (5 minutes) I added to fix #2255 leads to greater fragmentation, as the processes upstream of the batcher do not seem to produce anywhere near 100,000 records in 5 minutes. I suspect that the only way to truly resolve this, without resorting to unacceptably long timeouts is the DAG approach. |
I started a new reindex at about 1315. With the new settings, it has reached 33% in about 30 minutes. Let's see how long the remaining two thirds takes. These times for different fractions will give a clue as to whether the multiple processing of works I the relation embedder is significant. 50% in 45 minutes. 75% in a little under 80 minutes. 83% in 90 minutes 99% in 105 minutes, with about 5000 left on relation_embedder_input, and 400,000 on ingestor |
It has reached 100% with the queues looking like this:
|
From about 2 hours, the only things upstream of ingestor_works_input appear to be regular "normal running" records coming through the system, not part of the reindex. However, the works ingestor has started to fail. at about 2hrs 15 (135 minutes):
|
I ran the pipeline with one batcher with a 5 minute interval, these are the batch sizes of paths that it processed:
So, after a bit of a "warm up", it looks like about 30k per five minutes. I think 5 minutes is a bit too fragmented, but it's evidence that the 45 minute timeout is much too long. Looking at the pattern above, I suspect that 25 minutes is a good interval (first run about 50k, then it should carry on hitting 120k until upstream runs out of steam). This is consistent with there being roughly 0.5 million records that go through the batcher, processing them in in bundles of 120k means running it 5 times at best, (probably 6: 4* 120K + 1 either side), and the whole pipeline takes a bit over 2 hours. 25 minutes should not be a terrible time to wait at the end. |
Having recently run a catalogue reindex (2022-11-17), this is the output of
python get_reindex_status.py 2022-11-17
before redriving any DLQs.Counterintuitively, although 99% of the nearly 3 million records (2.76 works, 0.14 images), have been "reindexed successfully", over 3 million are still to go (there are still 1.67 million left to index, and 1.76 million that have failed to index).
The text was updated successfully, but these errors were encountered: