Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Deferred Sending of Duplicate Results #33

Open
prof-m opened this issue Aug 5, 2023 · 0 comments
Open

Support Deferred Sending of Duplicate Results #33

prof-m opened this issue Aug 5, 2023 · 0 comments

Comments

@prof-m
Copy link
Contributor

prof-m commented Aug 5, 2023

its me, ya boy, back again with the potential dupes queue πŸ‘‰πŸΌ πŸ‘‰πŸΌ πŸ•ΆοΈ

Feature

The sending of found duplicates to the Hydrus Client is deferred until the end of the duplicate search process. If any duplicates cannot be sent to the Hydrus Client at that time (e.g. if the client is offline), those duplicates are saved offline for sending in the future.

In other words, put your duplicates in a queue for safekeeping

Rationale

Easy stuff

  • clients (and network connections in general) are unreliable
  • the "phashing videos" step requires steady client access, but the "finding duplicates" step doesn't if it defers sending the dupes to the client api

The Large Library Problem

  • Let's call the time it takes to phash all the videos 'tP', and the time it takes to find all the duplicates 'tD'
  • On initial runs of the program, tP > tD in pretty much every case
  • However, once you've got all your videos in the database, subsequent runs of the program will start to have tD > tP - at least in the case where you have a large, pre-hashed library, and a relatively small number of new, un-hashed files.
  • That's because the number of new, unhashed files is small (relative to the total library size), and the number of files that need to be searched for dupes is equal to roughly (number of pre-hashed files * number of new files)

For example, let's take a database with 10,000 videos already phashed and dupe searched in it. The program is run, and finds 50 new videos in the client, all with similar durations to the videos already phashed. It phashes the 50 videos and adds them to the database. Then, it compares each of the 10,050 videos against the 50 new videos. Even if comparing a single video against the 50 new videos is a pretty quick process (and it is, thanks to parallelizing), comparing every single phashed video against the 50 new videos takes a lot longer than just phashing the 50 new videos.

Considerations

  • As previously discussed, we want this option to be relatively low-impact to both the code and the average user (aka, don't do this by default)
  • Even when this option is used, we still want it to be comparably performant.

PR: #34

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant