Skip to content

A java library to determine probability of objects being similar.

License

Notifications You must be signed in to change notification settings

vishnitin/fuzzy-matcher

 
 

Repository files navigation

Fuzzy-Matcher

Introduction

A java based library to match and group "similar" elements in a collection of document.

In system with a collection of contacts. If we wanted to match and categorize contacts with similar names, location where they live or with any other attribute. This matching algorithm helps to achieve it using Fuzzy match. In fact, you may even try to find out if you have already added duplicate contacts, or you can use this to prevent your system from adding one.

This library can act on any domain object like contact and find similarity for various use cases. It dives deep into each character and finds out the probability that 2 or more of these objects are similar.

How does this work

Fuzzy Match

Four Stages of Fuzzy Match

The input provided goes through 4 stages to come up with a match result. These 4 stages are functions defined which can be easily configured by passing a lambda expression.

  • Pre-Processing : Expects a Function<String, String> which a simple transformation from a String to another.

    • Trim: Removes leading and trailing spaces (applied by default)
    • Lower Case: Converts all characters to lower case (applied by default)
    • Remove Special Chars : Removes all characters except alphabets and numeric chars and space. (default for TEXT type)
    • Numeric: Strips all non-numeric characters. Useful for numeric values like phone or ssn (default for PHONE type)
    • Email: Strips away domain from an email. This prevents common domains like gmail.com, yahoo.com to be considered in match (default for EMAIL type)
  • Tokenization : Expects a Function<Element, Stream<Token>> which will break down each Element into Token object used for comparison.

    • Word : Breaks down an element into words (anything delimited by space " "). e.g. "123 some street" -> ["123" "some" "street"]
    • N-Gram : Breaks down an element into 3 letter grams. e.g. "jparker" -> ["jpa","par","ark","rke","ker"]
  • Similarity Match : Expects a BiFunction<Token, Token, Double>, which gives a match probability (0.0 to 1.0) between 2 tokens

    • Soundex: Compares 2 token strings soundex values. Uses apache commons codec library to get soundex values. The result of this match is a binary either 0.0 or 1.0
    • Equality: Does a string equals of 2 token strings. Again the result is either 0.0 or 1.0
    • Levenshtein: Gets the Levenshtein distance score using apache commons similarity library
    • Jaccard: Gets the Jaccard score using apache commons similarity library
  • Scoring : Expects a BiFunction<Match, List<Score>, Double>, this defines functions on how to accumulate scores from Tokens into Elements and from Elements into Documents.

    • Simple Average: Adds up total scores of each child matches / total children. This is the default scoring for Elements
    • Weighted Average: This is useful for Document Scoring, where users can input weights on elements. Example a phone number or email could be considered an important element to identify match between 2 User objects, and we can add weights to such elements.
    • Exponential Average: Again useful for Document Scoring, where if more than 1 element match, we can increase the scoring exponentially
    • Exponential Weighted Average: Uses both an exponents and weights for scoring. This is the default for Document Scoring

Performance

Since this library can be used to match elements against a large set of records, knowing how it performs is essential.

This library makes use of Java 8 Stream to run all operations in parallel and makes optimum use of a multi-core cpu. Beyond finding duplicates in a given set of records, this library avoids matching each element with every other element, and reduces the complexity which otherwise would be O(N^2)

Reducing Complexity to O(N Log N)

To reduce the complexity, the similarity match algorithm's chosen are assumed to have an equivalence property. Where if a name like "Stephen" matches with "Steven" with a score of 1.0, the reverse match is assumed, and the library does not explicitly runs those matches

Search Groups

The library further reduces the complexity by not performing matches against the elements which have a very low probability to match by creating "Search Groups". Take an example of list of names to match ["steve","parker","stephen"] These names are broken down into tri-grams like this

steve -> [ste,tev,eve]
parker -> [par,ark,rke,ker]
stephen -> [ste,tep,eph,phe,hen] 

Here only the 1st and 3rd names have tri-grams "ste" in common (and a search group is created for them.)
The match algorithm assumes a very low probability that "parker" will match with the other 2, and hence no match is attempted with it.

The following chart shows the performance characteristics of this library as the number of elements increase. As you can see the library maintains a near-linear performance and can match thousands of elements within seconds on a multi-core processor.

Perf

Building the Library

Prerequisite

You need Java SDK v1.8 or higher. Before you begin, you should check your current Java installation by using the following command: java -version

fuzzy-match is compatible with Apache Maven 4.0 or above. If you do not already have Maven installed, you can follow the instructions at maven.apache.org.

On many operating systems, Maven can be installed with a package manager.
If you use OSX Homebrew, try brew install maven.
Ubuntu users can run sudo apt-get install maven.
Windows users with Chocolatey can run choco install maven from an elevated (administrator) prompt.

Compiling and installing locally

After cloning the project locally. Run this command to compile, test and install the project

mvn clean install

Using the Library

Maven Import

The library is pusblished to maven central

<dependency>
    <groupId>com.intuit.fuzzymatcher</groupId>
    <artifactId>fuzzy-matcher</artifactId>
    <version>0.4.2</version>
</dependency>

Input

This library take a collection of Document object with various Element as input.

For example, if you have a User object in your system, you can easily convert it to a Document with this simple builder pattern provided

new Document.Builder(User.getId())
    .addElement(new Element.Builder().setType(TEXT).setValue(User.getName()).createElement())
    .addElement(new Element.Builder().setType(ADDRESS).setValue(User.getAddress()).createElement())
    .addElement(new Element.Builder().setType(PHONE).setValue(User.getPhone()).createElement())
    .addElement(new Element.Builder().setType(EMAIL).setValue(User.getEmail()).createElement())
    .createDocument();

The Element Types like "TEXT", "ADDRESS", "PHONE", "EMAIL" are just simple ways to define custom stages (or functions) to be applied at different stages of match.

Below is the list of Element Types available in the library with default PreProcessing Function, Tokenizer Function and Scoring Function.

Element Type PreProcessing Function Tokenizer Function Scoring Function
NAME namePreprocessing() wordTokenizer() soundex()
TEXT removeSpecialChars() wordTokenizer() soundex()
ADDRESS addressPreprocessing() wordTokenizer() soundex()
EMAIL removeDomain() nGramTokenizer() equality()
PHONE numericValue() valueTokenizer() phoneNumber()
NUMBER numberPreprocessing() valueTokenizer() numberDifferenceRate()
DATE none() valueTokenizer() dateDifferenceWithinYear()

Note: Since each element is unique in the way it should match, if you have a need to match a different element type than what is supported, please open a new GitHub Issue and the community will provide support and enhancement to this library

Applying the Match

The entry point for running this program is through MatchService class. Create a new instance of Match service.

MatchService matchService = new MatchService();

It support 3 ways to match the documents

  • Match a list of Documents : This is useful if you have an existing list of document, and want to find out which of them might have potential duplicates. A typical de-dup use case
matchService.applyMatch(List<Document> documents)
  • Match a list of Document with Existing List : This is useful for matching a new list of document with an existing list in your system. For example if your are performing a bulk import and want to find out if any of them match with existing data
matchService.applyMatch(List<Document> documents, List<Document> matchWith)
  • Match a Document with Existing List : This is useful when a new document is being created and you want to ensure that a similar document does not already exist in your system
matchService.applyMatch(Document document, List<Document> matchWith)

Output

The response of the library is essentially a Match<Document> object. It has 3 attributes

  • Data : This is the source Document on which the match is applied
  • MatchedWith : This is the target Document that the data matched with
  • Result : This is the probability score between 0.0 - 1.0 indicating how similar the 2 documents are

The response is grouped by the Data attribute, so from any of the MatchService methods the response is map

Map<Document, List<Match<Document>>>

End User Configuration

This library allows configuration at various level. These are all the configurable elements allowed

Configurable Functions

All the 4 Stages defined above are configurable java functions passed into the library. Although there is a rich set of predefined functions, they can easily be overridden or applied in addition of the exiting functions.

For example if an ElementType is defined as a TEXT the removeSpecialChars which is a simple transform function is applied. This is a Function<String, String> that converts a String to another String

If you want change this and say in "Name" field remove any suffix like "jr.", "sr.", etc

This can be over-ridden when creating an Element using the builder pattern

new Element.Builder().setType(TEXT).setValue(user.getName())
                            .setPreProcessingFunction(str -> str.replace("jr.", ""))
                            .createElement())

Using the function composition, you could also add more functionality to the existing function

new Element.Builder().setType(TEXT).setValue(user.getName())
                            .setPreProcessingFunction(removeSpecialChars().compose(str -> str.replace("jr.", ""))
                            .createElement())

This allows you to change the entire behavior of how matches are applied at all the 4 stages

Document Configuration

  • Key : Required field indicating unique primary key of the document
  • Elements : Set of elements for each document
  • Threshold : A double value between 0.0 - 1.0 above which the document be considered as match.
  • ScoringFunction : The ScoringFunction used to aggregate individual Element scores to a Document score

Element Configuration

  • Value : String representation of the value to match
  • Type : These are predefined elements, which applies relevant functions for "PreProcessing", "Tokenization" and "SimilarityMatch"
  • Variance: (Optional) To differentiate same element types in a document. eg. a document containing 2 NAME element one for "user" and one for "spouse"
  • Threshold : A double value between 0.0 - 1.0 above which the element be considered as match.
  • Weight : A value applied to an element to increase or decrease the document score. The default is 1.0, any value above that will increase the document score if that element is matched.
  • PreProcessingFunction : The PreProcessingFunction function to be used
  • TokenizerFunction : The TokenizerFunction to be used
  • SimilarityMatchFunction : The SimilarityMatchFunction to be used
  • ScoringFunction : The ScoringFunction used to aggregate individual Token scores to an Element score

Demo

To run a simple match. Consider this example of names (in file src/test/resources/demo.csv)

Stephen Wilkson
John Pierre
Wilson, Stephen
Pierre john
Stephen Kilsman wilkson

Run this maven test for the above set of example

mvn -Dtest=com.intuit.fuzzymatcher.component.MatchServiceTest#itShouldApplyMatchForDemo test

You should see a response like this printed on your console

Data: {[{'Stephen Wilkson'}]} Matched With: {[{'Stephen Kilsman wilkson'}]} Score: 0.6666666666666666
Data: {[{'John Pierre'}]} Matched With: {[{'Pierre john'}]} Score: 1.0
Data: {[{'Pierre john'}]} Matched With: {[{'John Pierre'}]} Score: 1.0
Data: {[{'Stephen Kilsman wilkson'}]} Matched With: {[{'Stephen Wilkson'}]} Score: 0.6666666666666666

Change the input and see the results change .... Happy Matching!!!

About

A java library to determine probability of objects being similar.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Java 100.0%