Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it possible to print the AMOUNT of similarity along with similar files? #12

Open
imthenachoman opened this issue Jun 27, 2023 · 4 comments
Labels
enhancement New feature or request

Comments

@imthenachoman
Copy link

I see the output puts similar files on one line.

Is it possible to also include the % similarity for each of them?

@jhnc jhnc added the enhancement New feature or request label Jun 27, 2023
@jhnc
Copy link
Owner

jhnc commented Jun 27, 2023

The fingerprint comparing code does actually return that value:

my $c = shift(@matches);

When I originally wrote the code, I probably intended to use the value. However, the grouping of filenames that happens immediately after that line means that there is not really any sensible way to present it.

Sometimes the grouping code does the wrong thing: similar(a,b) and similiar(b,c) does not imply similar(a,c) although the code pretends it does. Improving clustering of groupings is on my todo list (I suspect the maths needed may be more advanced than anything I did in school) . I've also been thinking about adding an option to disable grouping and just present the pairs of filenames.

It doesn't help you now but if I do add a disable grouping option, then I shall certainly consider presenting the similarity score in that output. Thank you for the suggestion.

@imthenachoman
Copy link
Author

similar(a,b) and similiar(b,c) does not imply similar(a,c)

This begs the question, how does your program identify that a, b, and c are similar and then group them together?

I don't know perl very well but, looking at the code, it looks like you're comparing two files at a time. So then I assume the code is:

  1. first comparing a to b
  2. then comparing b to c
  3. then comparing a to c`

So what happens if the % match for them are mixed? For example, what if % match for 1 and 2 is 95% but the match for 1 and 3 is 85%

@jhnc
Copy link
Owner

jhnc commented Jun 28, 2023

The grouping algorithm is broken.

Consider simple case with cut-down fingerprints and 2-bit cutoff:

f1 = 00000001
f2 = 00000011
f3 = 00000111
f4 = 00001111
f5 = 00011111
f6 = 00111111

We might start out with diffbits returning: [ (f1,f2,1), (f1,f3,2), (f2,f3,1), (f2,f4,2), (f3,f4,1), (f3,f5,2), (f4,f5,1), (f4,f6,2), (f5,f6,1) ]

The current grouping algorithm iterates over the fingerprint pairs, naively creating/coalescing groups when it finds a common element. Along the lines of:

( similar(f1,f2), similar(f1,f3) ) => similar(f1,f2,f3)

( similar(f1,f2,f3), similar(f2,f3) ) => similar(f1,f2,f3)

( similar(f1,f2,f3), similar(f3,f4) ) => similar(f1,f2,f3,f4)

and so on. Eventually it will incorrectly return a single coalesced group similar(f1,f2,f3,f4,f5,f6) even though clearly f1 is not similar to f6.

We should actually end up with overlapping groups. Perhaps something like: similar(f1,f2,f3), similar(f2,f3,f4), similar(f3,f4,f5), similar(f4,f5,f6)

I believe results are generally reasonable if findimage is asked to group results of searching for similarities to a small number of needles N in a large haystack H. The problems arise when N is large (eg. setting N=H).

@imthenachoman
Copy link
Author

I see.

Yeah, I think a no grouping option would make more sense where it just tells you pairs of images that are similar and their percent similarity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants