Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deduplicate messages on the ingestor queue #2459

Open
paul-butcher opened this issue Oct 4, 2023 · 1 comment
Open

Deduplicate messages on the ingestor queue #2459

paul-butcher opened this issue Oct 4, 2023 · 1 comment

Comments

@paul-butcher
Copy link
Contributor

Because of the behaviour of the relation embedder (see also #2256), the same record can be ingested multiple times during a reindex, or even during normal running when there is a particularly dendritic archive record (see Slack).

We can guard against this a little by changing the ingestor queue into a FIFO queue with content-based deduplication.

In some cases, the record will have been completely processed before the new message appears. In those situations, it will still be processed multiple times.

Using a FIFO queue will guard against the situation where a message to process a record is placed on the queue multiple times in quick succession, e.g. if the relation embedder processes it in two adjacent batches.

@paul-butcher
Copy link
Contributor Author

It may also be wise to deduplicate elsewhere, but it is in the nature of the relation embedder to flood the ingestor with duplicates, whereas duplication upstream of there is more likely to be due to multiple subsequent changes in the source data arriving faster than the pipeline processes them.

Such rapid changes are less common than the relation embedder sending duplicate messages.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant