-
Notifications
You must be signed in to change notification settings - Fork 494
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes #864 - Adds text similarity/distance methods and double metaphone text encoder. #865
Fixes #864 - Adds text similarity/distance methods and double metaphone text encoder. #865
Conversation
@@ -127,6 +127,7 @@ dependencies { | |||
compile group: 'com.github.javafaker', name: 'javafaker', version: '0.10' | |||
|
|||
compile group: 'org.apache.commons', name: 'commons-math3', version: '3.6.1' | |||
compile group: 'org.apache.commons', name: 'commons-text', version: '1.2' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how big is this dependency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just checked on my machine, commons-text-1.2.jar
is 136,544 bytes (139 KB on disk)
@@ -64,4 +67,52 @@ public double euclideanDistance(@Name("vector1") List<Number> vector1, @Name("ve | |||
public double euclideanSimilarity(@Name("vector1") List<Number> vector1, @Name("vector2") List<Number> vector2) { | |||
return 1.0d / (1 + euclideanDistance(vector1, vector2)); | |||
} | |||
|
|||
@UserFunction | |||
@Description("apoc.algo.levenshteinDistance(lhs, rhs) return the Levenshtein distance of two texts") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we already had that as "distance" function in apoc.text
your contributions should go there.
@@ -64,4 +67,52 @@ public double euclideanDistance(@Name("vector1") List<Number> vector1, @Name("ve | |||
public double euclideanSimilarity(@Name("vector1") List<Number> vector1, @Name("vector2") List<Number> vector2) { | |||
return 1.0d / (1 + euclideanDistance(vector1, vector2)); | |||
} | |||
|
|||
@UserFunction | |||
@Description("apoc.algo.levenshteinDistance(lhs, rhs) return the Levenshtein distance of two texts") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nothing should go into the apoc.algo
package until further notice to reduce confusion with the neo4j-graph-algorithms library
return 0.0; | ||
} | ||
|
||
JaroWinklerDistance jwd = new JaroWinklerDistance(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how expensive is the construction? e.g. in a tight loop?
5bace00
to
ffb5c65
Compare
I’ve updated the PR with the following changes:
|
Great, thanks so much. For the reuse, it is only an issue if the object creation is really expensive (think jackson ObjectMapper) can you make sure the static instances are thread-safe? Otherwise please revert it back if it's not expensive to create the instances. |
f42e6bf
to
d69f15e
Compare
I think they are all thread safe as they are all |
5acbedf
to
c1167c8
Compare
Can you please add it to the docs, at least to overview.adoc ? |
I'm pretty sure the added methods are thread safe since the Distance objects are immutable once instantiated and the @nammmm said he will add the documentation. |
c1167c8
to
d0a4dba
Compare
I have added the functions to overview.adoc. Let me know if there is anything else I need to do. |
EDIT: All ready to go! |
…ouble metaphone text encoder Added Apache commons-text dependency and used the LevenshteinDistance, HammingDistance and JaroWinklerDistance objects from the dependency to create the similarity/distance methods in Strings. Replaced deprecated StringUtils.getLevenshteinDistance with LevenshteinDistance. Added Double Metaphone encoding method in Phonetic.
d0a4dba
to
d4eae9f
Compare
…ne text encoder (#865) Added Apache commons-text dependency and used the LevenshteinDistance, HammingDistance and JaroWinklerDistance objects from the dependency to create the similarity/distance methods in Strings. Replaced deprecated StringUtils.getLevenshteinDistance with LevenshteinDistance. Added Double Metaphone encoding method in Phonetic.
…ouble metaphone text encoder (neo4j-contrib#865) Added Apache commons-text dependency and used the LevenshteinDistance, HammingDistance and JaroWinklerDistance objects from the dependency to create the similarity/distance methods in Strings. Replaced deprecated StringUtils.getLevenshteinDistance with LevenshteinDistance. Added Double Metaphone encoding method in Phonetic.
Adds text similarity/distance methods and double metaphone text encoding.
Added Levenshtein Distance and Similarity code and test.
Added Hamming Distance code and test.
Added Jaro-Winkler Distance code and test.
Added Double Metaphone text encoding and test.
Added Apache commons-text dependency.