Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes #864 - Adds text similarity/distance methods and double metaphone text encoder. #865

Merged

Conversation

alexiudice
Copy link
Contributor

Adds text similarity/distance methods and double metaphone text encoding.

Added Levenshtein Distance and Similarity code and test.

Added Hamming Distance code and test.

Added Jaro-Winkler Distance code and test.

Added Double Metaphone text encoding and test.

Added Apache commons-text dependency.

@alexiudice alexiudice changed the title fixes #864: Adds text similarity/distance methods and double metaphone text encoder Fixes #864 - Adds text similarity/distance methods and double metaphone text encoder. Jul 18, 2018
@@ -127,6 +127,7 @@ dependencies {
compile group: 'com.github.javafaker', name: 'javafaker', version: '0.10'

compile group: 'org.apache.commons', name: 'commons-math3', version: '3.6.1'
compile group: 'org.apache.commons', name: 'commons-text', version: '1.2'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how big is this dependency?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checked on my machine, commons-text-1.2.jar is 136,544 bytes (139 KB on disk)

@@ -64,4 +67,52 @@ public double euclideanDistance(@Name("vector1") List<Number> vector1, @Name("ve
public double euclideanSimilarity(@Name("vector1") List<Number> vector1, @Name("vector2") List<Number> vector2) {
return 1.0d / (1 + euclideanDistance(vector1, vector2));
}

@UserFunction
@Description("apoc.algo.levenshteinDistance(lhs, rhs) return the Levenshtein distance of two texts")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we already had that as "distance" function in apoc.text
your contributions should go there.

@@ -64,4 +67,52 @@ public double euclideanDistance(@Name("vector1") List<Number> vector1, @Name("ve
public double euclideanSimilarity(@Name("vector1") List<Number> vector1, @Name("vector2") List<Number> vector2) {
return 1.0d / (1 + euclideanDistance(vector1, vector2));
}

@UserFunction
@Description("apoc.algo.levenshteinDistance(lhs, rhs) return the Levenshtein distance of two texts")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nothing should go into the apoc.algo package until further notice to reduce confusion with the neo4j-graph-algorithms library

return 0.0;
}

JaroWinklerDistance jwd = new JaroWinklerDistance();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how expensive is the construction? e.g. in a tight loop?

@alexiudice alexiudice force-pushed the story-864-3.4-distanceAndMetaphoneMethods branch from 5bace00 to ffb5c65 Compare July 19, 2018 13:14
@alexiudice
Copy link
Contributor Author

I’ve updated the PR with the following changes:

  1. I've moved all of the added functions in apoc.algo to apoc.text and the tests from SimilarityTest.java to StringsTest.java.

  2. I've changed apoc.text.distance function a bit. The apoc.text.levenshteinDistance function has a more descriptive name so I kept it in. It now it just calls apoc.text.distance. I edited the apoc.text.distance so it does not use the deprecated method StringUtils.getLevenshteinDistance

  3. All three distance functions were instantiating a new object every call which can be expensive in a loop. To remedy this I've replaced these with calls to three static final objects added in the Strings.java class so they are instantiated only one time.

@jexp
Copy link
Member

jexp commented Jul 19, 2018

Great, thanks so much.

For the reuse, it is only an issue if the object creation is really expensive (think jackson ObjectMapper) can you make sure the static instances are thread-safe? Otherwise please revert it back if it's not expensive to create the instances.

@nammmm nammmm force-pushed the story-864-3.4-distanceAndMetaphoneMethods branch 3 times, most recently from f42e6bf to d69f15e Compare July 19, 2018 15:32
@nammmm
Copy link

nammmm commented Jul 19, 2018

I think they are all thread safe as they are all static final fields. In addition, I just finished cleaning up the code (code style, description and parameter names).

@nammmm nammmm force-pushed the story-864-3.4-distanceAndMetaphoneMethods branch 2 times, most recently from 5acbedf to c1167c8 Compare July 19, 2018 15:58
@jexp
Copy link
Member

jexp commented Jul 19, 2018

Can you please add it to the docs, at least to overview.adoc ?
Thanks so much.

@alexiudice
Copy link
Contributor Author

alexiudice commented Jul 19, 2018

I'm pretty sure the added methods are thread safe since the Distance objects are immutable once instantiated and the apply methods parameters are final. We could add a synchronized block around the apply method calls if we really want to make sure they are thread safe.

@nammmm said he will add the documentation.

@nammmm nammmm force-pushed the story-864-3.4-distanceAndMetaphoneMethods branch from c1167c8 to d0a4dba Compare July 19, 2018 17:44
@nammmm
Copy link

nammmm commented Jul 19, 2018

I have added the functions to overview.adoc. Let me know if there is anything else I need to do.

@alexiudice
Copy link
Contributor Author

alexiudice commented Jul 19, 2018

Please do not merge yet! Testing some last things.

EDIT: All ready to go!

…ouble metaphone text encoder

Added Apache commons-text dependency and used the LevenshteinDistance, HammingDistance and JaroWinklerDistance objects from the dependency to create the similarity/distance methods in Strings. Replaced deprecated StringUtils.getLevenshteinDistance with LevenshteinDistance.

Added Double Metaphone encoding method in Phonetic.
@nammmm nammmm force-pushed the story-864-3.4-distanceAndMetaphoneMethods branch from d0a4dba to d4eae9f Compare July 19, 2018 18:25
@jexp jexp merged commit bbc36ef into neo4j-contrib:3.4 Jul 20, 2018
jexp pushed a commit that referenced this pull request Jul 23, 2018
…ne text encoder (#865)

Added Apache commons-text dependency and used the LevenshteinDistance, HammingDistance and JaroWinklerDistance objects from the dependency to create the similarity/distance methods in Strings. 
Replaced deprecated StringUtils.getLevenshteinDistance with LevenshteinDistance.

Added Double Metaphone encoding method in Phonetic.
alexiudice added a commit to graphgrid/neo4j-apoc-procedures that referenced this pull request Jul 25, 2018
…ouble metaphone text encoder (neo4j-contrib#865)

Added Apache commons-text dependency and used the LevenshteinDistance, HammingDistance and JaroWinklerDistance objects from the dependency to create the similarity/distance methods in Strings. 
Replaced deprecated StringUtils.getLevenshteinDistance with LevenshteinDistance.

Added Double Metaphone encoding method in Phonetic.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants