You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
its me, ya boy, back again with the potential dupes queue ππΌ ππΌ πΆοΈ
Feature
The sending of found duplicates to the Hydrus Client is deferred until the end of the duplicate search process. If any duplicates cannot be sent to the Hydrus Client at that time (e.g. if the client is offline), those duplicates are saved offline for sending in the future.
In other words, put your duplicates in a queue for safekeeping
Rationale
Easy stuff
clients (and network connections in general) are unreliable
the "phashing videos" step requires steady client access, but the "finding duplicates" step doesn't if it defers sending the dupes to the client api
The Large Library Problem
Let's call the time it takes to phash all the videos 'tP', and the time it takes to find all the duplicates 'tD'
On initial runs of the program, tP > tD in pretty much every case
However, once you've got all your videos in the database, subsequent runs of the program will start to have tD > tP - at least in the case where you have a large, pre-hashed library, and a relatively small number of new, un-hashed files.
That's because the number of new, unhashed files is small (relative to the total library size), and the number of files that need to be searched for dupes is equal to roughly (number of pre-hashed files * number of new files)
For example, let's take a database with 10,000 videos already phashed and dupe searched in it. The program is run, and finds 50 new videos in the client, all with similar durations to the videos already phashed. It phashes the 50 videos and adds them to the database. Then, it compares each of the 10,050 videos against the 50 new videos. Even if comparing a single video against the 50 new videos is a pretty quick process (and it is, thanks to parallelizing), comparing every single phashed video against the 50 new videos takes a lot longer than just phashing the 50 new videos.
Considerations
As previously discussed, we want this option to be relatively low-impact to both the code and the average user (aka, don't do this by default)
Even when this option is used, we still want it to be comparably performant.
its me, ya boy, back again with the potential dupes queue ππΌ ππΌ πΆοΈ
Feature
The sending of found duplicates to the Hydrus Client is deferred until the end of the duplicate search process. If any duplicates cannot be sent to the Hydrus Client at that time (e.g. if the client is offline), those duplicates are saved offline for sending in the future.
In other words, put your duplicates in a queue for safekeeping
Rationale
Easy stuff
The Large Library Problem
For example, let's take a database with 10,000 videos already phashed and dupe searched in it. The program is run, and finds 50 new videos in the client, all with similar durations to the videos already phashed. It phashes the 50 videos and adds them to the database. Then, it compares each of the 10,050 videos against the 50 new videos. Even if comparing a single video against the 50 new videos is a pretty quick process (and it is, thanks to parallelizing), comparing every single phashed video against the 50 new videos takes a lot longer than just phashing the 50 new videos.
Considerations
PR: #34
The text was updated successfully, but these errors were encountered: