A java based library to match and group "similar" elements in a collection of document.
In system with a collection of contacts. If we wanted to match and categorize contacts with similar names, location where they live or with any other attribute. This matching algorithm helps to achieve it using Fuzzy match. In fact, you may even try to find out if you have already added duplicate contacts, or you can use this to prevent your system from adding one.
This library can act on any domain object like contact and find similarity for various use cases. It dives deep into each character and finds out the probability that 2 or more of these objects are similar.
The input provided goes through 4 stages to come up with a match result. These 4 stages are functions defined which can be easily configured by passing a lambda expression.
-
Pre-Processing : Expects a
Function<String, String>
which a simple transformation from a String to another.- Trim: Removes leading and trailing spaces (applied by default)
- Lower Case: Converts all characters to lower case (applied by default)
- Remove Special Chars : Removes all characters except alphabets and numeric chars and space. (default for TEXT type)
- Numeric: Strips all non-numeric characters. Useful for numeric values like phone or ssn (default for PHONE type)
- Email: Strips away domain from an email. This prevents common domains like gmail.com, yahoo.com to be considered in match (default for EMAIL type)
-
Tokenization : Expects a
Function<Element, Stream<Token>>
which will break down each Element into Token object used for comparison.- Word : Breaks down an element into words (anything delimited by space " ").
e.g.
"123 some street" -> ["123" "some" "street"]
- N-Gram : Breaks down an element into 3 letter grams.
e.g.
"jparker" -> ["jpa","par","ark","rke","ker"]
- Word : Breaks down an element into words (anything delimited by space " ").
e.g.
-
Similarity Match : Expects a
BiFunction<Token, Token, Double>
, which gives a match probability (0.0 to 1.0) between 2 tokens- Soundex: Compares 2 token strings soundex values. Uses apache commons codec library to get soundex values. The result of this match is a binary either 0.0 or 1.0
- Equality: Does a string equals of 2 token strings. Again the result is either 0.0 or 1.0
- Levenshtein: Gets the Levenshtein distance score using apache commons similarity library
- Jaccard: Gets the Jaccard score using apache commons similarity library
-
Scoring : Expects a
BiFunction<Match, List<Score>, Double>
, this defines functions on how to accumulate scores from Tokens into Elements and from Elements into Documents.- Simple Average: Adds up total scores of each child matches / total children. This is the default scoring for Elements
- Weighted Average: This is useful for Document Scoring, where users can input weights on elements. Example a phone number or email could be considered an important element to identify match between 2 User objects, and we can add weights to such elements.
- Exponential Average: Again useful for Document Scoring, where if more than 1 element match, we can increase the scoring exponentially
- Exponential Weighted Average: Uses both an exponents and weights for scoring. This is the default for Document Scoring
Since this library can be used to match elements against a large set of records, knowing how it performs is essential.
This library makes use of Java 8 Stream to run all operations in parallel and makes optimum use of a multi-core cpu. Beyond finding duplicates in a given set of records, this library avoids matching each element with every other element, and reduces the complexity which otherwise would be O(N^2)
To reduce the complexity, the similarity match algorithm's chosen are assumed to have an equivalence property. Where if a name like "Stephen" matches with "Steven" with a score of 1.0, the reverse match is assumed, and the library does not explicitly runs those matches
The library further reduces the complexity by not performing matches against the elements which have a very low
probability to match by creating "Search Groups". Take an example of list of names to match ["steve","parker","stephen"]
These names are broken down into tri-grams like this
steve -> [ste,tev,eve]
parker -> [par,ark,rke,ker]
stephen -> [ste,tep,eph,phe,hen]
Here only the 1st and 3rd names have tri-grams "ste" in common (and a search group is created for them.)
The match algorithm assumes a very low probability that "parker" will match with the other 2, and hence no match is attempted with it.
The following chart shows the performance characteristics of this library as the number of elements increase. As you can see the library maintains a near-linear performance and can match thousands of elements within seconds on a multi-core processor.
You need Java SDK v1.8 or higher. Before you begin, you should check your current Java installation by using the following command:
java -version
fuzzy-match is compatible with Apache Maven 4.0 or above. If you do not already have Maven installed, you can follow the instructions at maven.apache.org.
On many operating systems, Maven can be installed with a package manager.
If you use OSX Homebrew, try brew install maven.
Ubuntu users can run sudo apt-get install maven.
Windows users with Chocolatey can run choco install maven from an elevated (administrator) prompt.
After cloning the project locally. Run this command to compile, test and install the project
mvn clean install
The library is pusblished to maven central
<dependency>
<groupId>com.intuit.fuzzymatcher</groupId>
<artifactId>fuzzy-matcher</artifactId>
<version>0.4.2</version>
</dependency>
This library take a collection of Document object with various Element as input.
For example, if you have a User object in your system, you can easily convert it to a Document with this simple builder pattern provided
new Document.Builder(User.getId())
.addElement(new Element.Builder().setType(TEXT).setValue(User.getName()).createElement())
.addElement(new Element.Builder().setType(ADDRESS).setValue(User.getAddress()).createElement())
.addElement(new Element.Builder().setType(PHONE).setValue(User.getPhone()).createElement())
.addElement(new Element.Builder().setType(EMAIL).setValue(User.getEmail()).createElement())
.createDocument();
The Element Types like "TEXT", "ADDRESS", "PHONE", "EMAIL" are just simple ways to define custom stages (or functions) to be applied at different stages of match.
Below is the list of Element Types available in the library with default PreProcessing Function, Tokenizer Function and Scoring Function.
Element Type | PreProcessing Function | Tokenizer Function | Scoring Function |
---|---|---|---|
NAME | namePreprocessing() | wordTokenizer() | soundex() |
TEXT | removeSpecialChars() | wordTokenizer() | soundex() |
ADDRESS | addressPreprocessing() | wordTokenizer() | soundex() |
removeDomain() | nGramTokenizer() | equality() | |
PHONE | numericValue() | valueTokenizer() | phoneNumber() |
NUMBER | numberPreprocessing() | valueTokenizer() | numberDifferenceRate() |
DATE | none() | valueTokenizer() | dateDifferenceWithinYear() |
Note: Since each element is unique in the way it should match, if you have a need to match a different element type than what is supported, please open a new GitHub Issue and the community will provide support and enhancement to this library
The entry point for running this program is through MatchService class. Create a new instance of Match service.
MatchService matchService = new MatchService();
It support 3 ways to match the documents
- Match a list of Documents : This is useful if you have an existing list of document, and want to find out which of them might have potential duplicates. A typical de-dup use case
matchService.applyMatch(List<Document> documents)
- Match a list of Document with Existing List : This is useful for matching a new list of document with an existing list in your system. For example if your are performing a bulk import and want to find out if any of them match with existing data
matchService.applyMatch(List<Document> documents, List<Document> matchWith)
- Match a Document with Existing List : This is useful when a new document is being created and you want to ensure that a similar document does not already exist in your system
matchService.applyMatch(Document document, List<Document> matchWith)
The response of the library is essentially a Match<Document>
object. It has 3 attributes
- Data : This is the source Document on which the match is applied
- MatchedWith : This is the target Document that the data matched with
- Result : This is the probability score between 0.0 - 1.0 indicating how similar the 2 documents are
The response is grouped by the Data attribute, so from any of the MatchService methods the response is map
Map<Document, List<Match<Document>>>
This library allows configuration at various level. These are all the configurable elements allowed
All the 4 Stages defined above are configurable java functions passed into the library. Although there is a rich set of predefined functions, they can easily be overridden or applied in addition of the exiting functions.
For example if an ElementType is defined as a TEXT the removeSpecialChars
which is a simple transform
function is applied. This is a Function<String, String>
that converts a String to another String
If you want change this and say in "Name" field remove any suffix like "jr.", "sr.", etc
This can be over-ridden when creating an Element using the builder pattern
new Element.Builder().setType(TEXT).setValue(user.getName())
.setPreProcessingFunction(str -> str.replace("jr.", ""))
.createElement())
Using the function composition, you could also add more functionality to the existing function
new Element.Builder().setType(TEXT).setValue(user.getName())
.setPreProcessingFunction(removeSpecialChars().compose(str -> str.replace("jr.", ""))
.createElement())
This allows you to change the entire behavior of how matches are applied at all the 4 stages
- Key : Required field indicating unique primary key of the document
- Elements : Set of elements for each document
- Threshold : A double value between 0.0 - 1.0 above which the document be considered as match.
- ScoringFunction : The ScoringFunction used to aggregate individual Element scores to a Document score
- Value : String representation of the value to match
- Type : These are predefined elements, which applies relevant functions for "PreProcessing", "Tokenization" and "SimilarityMatch"
- Variance: (Optional) To differentiate same element types in a document. eg. a document containing 2 NAME element one for "user" and one for "spouse"
- Threshold : A double value between 0.0 - 1.0 above which the element be considered as match.
- Weight : A value applied to an element to increase or decrease the document score. The default is 1.0, any value above that will increase the document score if that element is matched.
- PreProcessingFunction : The PreProcessingFunction function to be used
- TokenizerFunction : The TokenizerFunction to be used
- SimilarityMatchFunction : The SimilarityMatchFunction to be used
- ScoringFunction : The ScoringFunction used to aggregate individual Token scores to an Element score
To run a simple match. Consider this example of names (in file src/test/resources/demo.csv)
Stephen Wilkson
John Pierre
Wilson, Stephen
Pierre john
Stephen Kilsman wilkson
Run this maven test for the above set of example
mvn -Dtest=com.intuit.fuzzymatcher.component.MatchServiceTest#itShouldApplyMatchForDemo test
You should see a response like this printed on your console
Data: {[{'Stephen Wilkson'}]} Matched With: {[{'Stephen Kilsman wilkson'}]} Score: 0.6666666666666666
Data: {[{'John Pierre'}]} Matched With: {[{'Pierre john'}]} Score: 1.0
Data: {[{'Pierre john'}]} Matched With: {[{'John Pierre'}]} Score: 1.0
Data: {[{'Stephen Kilsman wilkson'}]} Matched With: {[{'Stephen Wilkson'}]} Score: 0.6666666666666666
Change the input and see the results change .... Happy Matching!!!