Similarity-join

This package contains modules for doing (in-memory) similarity joins (i.e., approximate string matching between lists).

A similarity join finds similar records in lists. Given two lists of strings (called records) R and S, a similarity join will find for each record in R all records in S that are similar. Similarity is defined by a distance metric. The choice of metric is use-case specific.

The package currently contains two modules.

cosinejoin

This module does approximate string matching using cosine similarity as a distance metric. The module comes with an option to approximate the results. Approximation greatly reduces time and memory foodprints.

The module creates an intermediate representation of the dataset that becomes quite large. Your data too big for this module? A cosine join can be implemented in SQL [2].

triejoin

This module allows for joins between two sets of strings subject to a similarity constrain (edit distance). The algorithms implemented are inspired by [1].

Please refer to my site for the documentation.

Bibliography

[1] - Trie-Join:Efficient Trie-based String Similarity Joins with Edit Distance Constraints; Jiannan Wang, Jianhua Feng, Guoliang Li.

[2] - Text joins in an RDBMS for web data integration. Gravano, L., Ipeirotis, P. G., Koudas, N., & Srivastava, D.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
simjoin		simjoin
CHANGES.txt		CHANGES.txt
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Similarity-join

cosinejoin

triejoin

Bibliography

About

Releases

Packages

Languages

License

VascoVisser/similarity-join

Folders and files

Latest commit

History

Repository files navigation

Similarity-join

cosinejoin

triejoin

Bibliography

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages