Skip to content

fritshermans/pyminhash

Repository files navigation

Version Downloads Conda - Platform Conda (channel only) Conda Recipe Docs - GitHub.io

PyMinHash

MinHashing is a very efficient way of finding similar records in a dataset based on Jaccard similarity. PyMinHash implements efficient minhashing for Pandas dataframes. See instructions below or look at the example notebook to get started.

Developed by Frits Hermans

Documentation

Documentation can be found here

Installation

Normal installation

Using PyPI

pip install pyminhash

Using conda

conda install -c conda-forge pyminhash

Install to contribute

Clone this Github repo and install in editable mode:

python -m pip install -e ".[dev]"
python setup.py develop

Usage

Apply record matching to column name of your Pandas dataframe df as follows:

myHasher = MinHash(n_hash_tables=10)
myHasher.fit_predict(df, 'name')

This will return the row pairs from df that have non-zero Jaccard similarity.