Support Deferred Sending of Duplicate Results #33

prof-m · 2023-08-05T00:23:18Z

its me, ya boy, back again with the potential dupes queue 👉🏼 👉🏼 🕶️

Feature

The sending of found duplicates to the Hydrus Client is deferred until the end of the duplicate search process. If any duplicates cannot be sent to the Hydrus Client at that time (e.g. if the client is offline), those duplicates are saved offline for sending in the future.

In other words, put your duplicates in a queue for safekeeping

Rationale

Easy stuff

clients (and network connections in general) are unreliable
the "phashing videos" step requires steady client access, but the "finding duplicates" step doesn't if it defers sending the dupes to the client api

The Large Library Problem

Let's call the time it takes to phash all the videos 'tP', and the time it takes to find all the duplicates 'tD'
On initial runs of the program, tP > tD in pretty much every case
However, once you've got all your videos in the database, subsequent runs of the program will start to have tD > tP - at least in the case where you have a large, pre-hashed library, and a relatively small number of new, un-hashed files.
That's because the number of new, unhashed files is small (relative to the total library size), and the number of files that need to be searched for dupes is equal to roughly (number of pre-hashed files * number of new files)

For example, let's take a database with 10,000 videos already phashed and dupe searched in it. The program is run, and finds 50 new videos in the client, all with similar durations to the videos already phashed. It phashes the 50 videos and adds them to the database. Then, it compares each of the 10,050 videos against the 50 new videos. Even if comparing a single video against the 50 new videos is a pretty quick process (and it is, thanks to parallelizing), comparing every single phashed video against the 50 new videos takes a lot longer than just phashing the 50 new videos.

Considerations

As previously discussed, we want this option to be relatively low-impact to both the code and the average user (aka, don't do this by default)
Even when this option is used, we still want it to be comparably performant.

PR: #34

prof-m mentioned this issue Aug 5, 2023

Support Deferred Sending of Duplicate Results #34

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Deferred Sending of Duplicate Results #33

Support Deferred Sending of Duplicate Results #33

prof-m commented Aug 5, 2023 •

edited

Loading

Support Deferred Sending of Duplicate Results #33

Support Deferred Sending of Duplicate Results #33

Comments

prof-m commented Aug 5, 2023 • edited Loading

Feature

Rationale

Easy stuff

The Large Library Problem

Considerations

prof-m commented Aug 5, 2023 •

edited

Loading