deduplicate-elasticsearch

A python script to detect duplicate documents in Elasticsearch.

The script creates a file containing a delete action for each duplicate document that is detected, which looks like this:

Single index is provided:

{"delete": {"_type": "doc", "_id": "1", "_index": "test_index"}}
{"delete": {"_type": "doc", "_id": "2", "_index": "test_index"}}
{"delete": {"_type": "doc", "_id": "3", "_index": "test_index"}}

Index pattern or alias is provided (test_index_*):

{"delete": {"_type": "doc", "_id": "1", "_index": "test_index_a"}}
{"delete": {"_type": "doc", "_id": "2", "_index": "test_index_b"}}
{"delete": {"_type": "doc", "_id": "3", "_index": "test_index_c"}}

Once all duplicates are found, you can use the bulk API to perform all of the delete operations in a single API call by either providing text file input to curl or attaching the file contents in the request body.

The curl command for the first option would be the following:

curl -s -H 'Content-Type: application/x-ndjson' -XPOST 'localhost:9200/_bulk' --data-binary "@bulk_deletions_file.txt"

Usage

$ python deduplicate-elasticsearch/deduplicate-elaticsearch.py -h


usage: deduplicate-elaticsearch.py [-h] [-es ES_HOST] -i INDEX -k KEYS
                                   [KEYS ...]

optional arguments:
  -h, --help            show this help message and exit
  -es ES_HOST, --es_host ES_HOST
                        Elasticsearch host
  -i INDEX, --index INDEX
                        <Required> Index name or alias to search on
  -k KEYS [KEYS ...], --keys KEYS [KEYS ...]
                        <Required> List of fields that will determine
                        duplicate docs

For a full description on how this script works including an analysis of the memory requirements, see: https://alexmarquardt.com/2018/07/23/deduplicating-documents-in-elasticsearch/

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
LICENSE		LICENSE
README.md		README.md
deduplicate-elaticsearch.py		deduplicate-elaticsearch.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

deduplicate-elasticsearch

Usage

About

Releases

Packages

Languages

License

moraitisk/deduplicate-elasticsearch

Folders and files

Latest commit

History

Repository files navigation

deduplicate-elasticsearch

Usage

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages