Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Command syntax to list duplicates found in defined input folder and not duplicates within signature database/index #19

Open
slrslr opened this issue Jun 15, 2024 · 2 comments

Comments

@slrslr
Copy link

slrslr commented Jun 15, 2024

I want to display duplicates across/among folders 1 a 2 which contains duplicates and exclude non cross-folder duplicates, meaning exclude duplicate pairs which are only within STDIN input folder for example).

A) Indexing only folder 1 ---> fails to find folder 2 duplicate/s for some reason:

cd /dev/shm && mkdir 1 2 && 
wget -q "https://avatars.githubusercontent.com/u/6596726?s=80&v=4" -O 1/s.jpg && 
wget -q "https://avatars.githubusercontent.com/u/5080531?s=52&v=4" -O 1/j.jpg && 
cp 1/s.jpg 1/s_copy.jpg && cp 1/s.jpg 2/s_copy.jpg && 
cp 1/j.jpg 1/j_copy.jpg
ls 1 2 && 
findimagedupes -f index -n -- "1/" && 
ls -A1 "2/" | findimagedupes -f index -t 100% -- - && 
find 1 2 index -delete;cd - 1>/dev/null

B) Indexing folder 1 and 2 ---> works to find folder 2 duplicates, yet it is displaying duplicates withing single folder (1) which is unwanted by me, i want to ignore folder 1 because duplicates in it are OK:

cd /dev/shm && mkdir 1 2 && 
wget -q "https://avatars.githubusercontent.com/u/6596726?s=80&v=4" -O 1/s.jpg && 
wget -q "https://avatars.githubusercontent.com/u/5080531?s=52&v=4" -O 1/j.jpg && 
cp 1/s.jpg 1/s_copy.jpg && cp 1/s.jpg 2/s_copy.jpg && 
cp 1/j.jpg 1/j_copy.jpg
ls 1 2 && 
findimagedupes -f index -n -- "1/" && 
findimagedupes -f index -n -- "2/" && 
ls -A1 "2/" | findimagedupes -f index -t 100% -- - && 
find 1 2 index -delete;cd - 1>/dev/null

C) Indexing only folder 2 ---> fails to find any duplicates:

cd /dev/shm && mkdir 1 2 && 
wget -q "https://avatars.githubusercontent.com/u/6596726?s=80&v=4" -O 1/s.jpg && 
wget -q "https://avatars.githubusercontent.com/u/5080531?s=52&v=4" -O 1/j.jpg && 
cp 1/s.jpg 1/s_copy.jpg && cp 1/s.jpg 2/s_copy.jpg && 
cp 1/j.jpg 1/j_copy.jpg
ls 1 2 && 
findimagedupes -f index -n -- "2/" && 
ls -A1 "2/" | findimagedupes -f index -t 100% -- - && 
find 1 2 index -delete;cd - 1>/dev/null

It is for me very hard to understand correct syntax to achieve mentioned result and the logic of the program regarding input folder/file versus the signature index and its contents.

  1. How to exclude duplicates within index sig. database and only display ones which are across index sig. database and STDIN folder/files
  2. If/how inclusion in a sig. index affects the results (shown in above mentioned A, B, C cases) input files/folder.
@jhnc
Copy link
Owner

jhnc commented Jun 15, 2024

A)

ls -A1 "2/" returns s_copy.jpg but for findimagedupes to be able find the file, the directory name has to be included (i.e. 2/s_copy.jpg). So your second findimagedupes command should be:

ls -A1 "2/"* | findimagedupes -f index 100% -- -

or just:

findimagedupes -f index -t 100% -- 2/

B)

  • you have the same problem with path - must be ls -A1 "2/"*
  • you have not provided the -a argument

This invocation should give the desired output:

ls -A1 "2/"* | findimagedupes -f index -t 100% -a -- -

Note that the first two invocations could be combined as:

findimagedupes -f index -n -- "1/" "2/"

C)

As above, you have not specified a valid file list so findimagedupes has no files to find.

If you fixed that problem (e.g. ls -A1 "2/"* | ...) then you run into the second issue which is that there is only 1 file in the folder. findimagedupes does not consider a file to be a duplicate of itself. There is a note at the end of the manpage:

Repetitions are culled before comparisons take place, so a commandline like "findimagedupes a.jpg a.jpg" will not produce a match.

This applies even if the second filename is hidden inside a database, as it is in your command.


I hope that helps clarify what is happening.

@slrslr
Copy link
Author

slrslr commented Jun 17, 2024

Thank you for fixing my commands. Since I think that there is not many people who would remember correct switches and use of asterisks and quickly understand findimagedupes manual, here are some example commands that suggests syntax that should be correct:

BUILDING index of images:

findimagedupes -R -q -f $HOME/findimagedupes.index --prune -n -- "/images-folder/"

(to not go recursively into sub-directories, remove " -R", do the same for the following commands too)
(to index multiple folders, just add a space after first folder and add next folder name)

IMAGE VS INDEX - Duplicates of a single image inside the already built index:

findimagedupes -q -f $HOME/findimagedupes.index -a -- "/path/to/image.jpg"

(to show less similar images, replace "-a" by "-t 70% -a")

FOLDER VS INDEX - Duplicates between a defined folder and already built index (not showing duplicates that are only inside index, thanks to -a switch):

findimagedupes -R -q -f $HOME/findimagedupes.index -t 100% -a -- /images-folder-2/

FOLDER VS INDEX & WITHIN INDEX - ALL - Duplicates between a defined folder and already built index:

findimagedupes -R -q -f $HOME/findimagedupes.index -t 100% -- /images-folder-2/

(decrease -t switch value to show less similar)

WITHIN INDEX - Duplicates within already built index:

findimagedupes -q -f $HOME/findimagedupes.index -t 100%


If one needs to do custom action with the duplicates found, here are some examples.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants