Is it possible to print the AMOUNT of similarity along with similar files? #12

imthenachoman · 2023-06-27T13:56:06Z

I see the output puts similar files on one line.

Is it possible to also include the % similarity for each of them?

jhnc · 2023-06-27T18:10:45Z

The fingerprint comparing code does actually return that value:

Line 539 in a787e23

my $c = shift(@matches);

When I originally wrote the code, I probably intended to use the value. However, the grouping of filenames that happens immediately after that line means that there is not really any sensible way to present it.

Sometimes the grouping code does the wrong thing: similar(a,b) and similiar(b,c) does not imply similar(a,c) although the code pretends it does. Improving clustering of groupings is on my todo list (I suspect the maths needed may be more advanced than anything I did in school) . I've also been thinking about adding an option to disable grouping and just present the pairs of filenames.

It doesn't help you now but if I do add a disable grouping option, then I shall certainly consider presenting the similarity score in that output. Thank you for the suggestion.

imthenachoman · 2023-06-28T02:44:12Z

similar(a,b) and similiar(b,c) does not imply similar(a,c)

This begs the question, how does your program identify that a, b, and c are similar and then group them together?

I don't know perl very well but, looking at the code, it looks like you're comparing two files at a time. So then I assume the code is:

first comparing a to b
then comparing b to c
then comparing a to c`

So what happens if the % match for them are mixed? For example, what if % match for 1 and 2 is 95% but the match for 1 and 3 is 85%

jhnc · 2023-06-28T07:24:45Z

The grouping algorithm is broken.

Consider simple case with cut-down fingerprints and 2-bit cutoff:

f1 = 00000001
f2 = 00000011
f3 = 00000111
f4 = 00001111
f5 = 00011111
f6 = 00111111

We might start out with diffbits returning: [ (f1,f2,1), (f1,f3,2), (f2,f3,1), (f2,f4,2), (f3,f4,1), (f3,f5,2), (f4,f5,1), (f4,f6,2), (f5,f6,1) ]

The current grouping algorithm iterates over the fingerprint pairs, naively creating/coalescing groups when it finds a common element. Along the lines of:

( similar(f1,f2), similar(f1,f3) ) => similar(f1,f2,f3)

( similar(f1,f2,f3), similar(f2,f3) ) => similar(f1,f2,f3)

( similar(f1,f2,f3), similar(f3,f4) ) => similar(f1,f2,f3,f4)

and so on. Eventually it will incorrectly return a single coalesced group similar(f1,f2,f3,f4,f5,f6) even though clearly f1 is not similar to f6.

We should actually end up with overlapping groups. Perhaps something like: similar(f1,f2,f3), similar(f2,f3,f4), similar(f3,f4,f5), similar(f4,f5,f6)

I believe results are generally reasonable if findimage is asked to group results of searching for similarities to a small number of needles N in a large haystack H. The problems arise when N is large (eg. setting N=H).

imthenachoman · 2023-06-28T18:38:50Z

I see.

Yeah, I think a no grouping option would make more sense where it just tells you pairs of images that are similar and their percent similarity.

jhnc added the enhancement New feature or request label Jun 27, 2023

jhnc mentioned this issue Sep 12, 2023

how i can extract the results in a csv or text file, with every similar files separated with a new line? #14

Open

jhnc mentioned this issue Jun 18, 2024

How to process the output files which paths contains spaces? The output can not be customized by the findimagedupes ? #20

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is it possible to print the AMOUNT of similarity along with similar files? #12

Is it possible to print the AMOUNT of similarity along with similar files? #12

imthenachoman commented Jun 27, 2023

jhnc commented Jun 27, 2023

imthenachoman commented Jun 28, 2023

jhnc commented Jun 28, 2023 •

edited

Loading

imthenachoman commented Jun 28, 2023

Is it possible to print the AMOUNT of similarity along with similar files? #12

Is it possible to print the AMOUNT of similarity along with similar files? #12

Comments

imthenachoman commented Jun 27, 2023

jhnc commented Jun 27, 2023

imthenachoman commented Jun 28, 2023

jhnc commented Jun 28, 2023 • edited Loading

imthenachoman commented Jun 28, 2023

jhnc commented Jun 28, 2023 •

edited

Loading