-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is it possible to print the AMOUNT of similarity along with similar files? #12
Comments
The fingerprint comparing code does actually return that value: Line 539 in a787e23
When I originally wrote the code, I probably intended to use the value. However, the grouping of filenames that happens immediately after that line means that there is not really any sensible way to present it. Sometimes the grouping code does the wrong thing: similar(a,b) and similiar(b,c) does not imply similar(a,c) although the code pretends it does. Improving clustering of groupings is on my todo list (I suspect the maths needed may be more advanced than anything I did in school) . I've also been thinking about adding an option to disable grouping and just present the pairs of filenames. It doesn't help you now but if I do add a disable grouping option, then I shall certainly consider presenting the similarity score in that output. Thank you for the suggestion. |
This begs the question, how does your program identify that I don't know perl very well but, looking at the code, it looks like you're comparing two files at a time. So then I assume the code is:
So what happens if the % match for them are mixed? For example, what if % match for 1 and 2 is 95% but the match for 1 and 3 is 85% |
The grouping algorithm is broken. Consider simple case with cut-down fingerprints and 2-bit cutoff:
We might start out with diffbits returning: [ (f1,f2,1), (f1,f3,2), (f2,f3,1), (f2,f4,2), (f3,f4,1), (f3,f5,2), (f4,f5,1), (f4,f6,2), (f5,f6,1) ] The current grouping algorithm iterates over the fingerprint pairs, naively creating/coalescing groups when it finds a common element. Along the lines of: ( similar(f1,f2), similar(f1,f3) ) => similar(f1,f2,f3) ( similar(f1,f2,f3), similar(f2,f3) ) => similar(f1,f2,f3) ( similar(f1,f2,f3), similar(f3,f4) ) => similar(f1,f2,f3,f4) and so on. Eventually it will incorrectly return a single coalesced group similar(f1,f2,f3,f4,f5,f6) even though clearly f1 is not similar to f6. We should actually end up with overlapping groups. Perhaps something like: similar(f1,f2,f3), similar(f2,f3,f4), similar(f3,f4,f5), similar(f4,f5,f6) I believe results are generally reasonable if findimage is asked to group results of searching for similarities to a small number of needles N in a large haystack H. The problems arise when N is large (eg. setting N=H). |
I see. Yeah, I think a no grouping option would make more sense where it just tells you pairs of images that are similar and their percent similarity. |
I see the output puts similar files on one line.
Is it possible to also include the % similarity for each of them?
The text was updated successfully, but these errors were encountered: