# dedupe The `dedupe` processor is used to dedupe an array of [DataEntities](https://terascope.github.io/teraslice/docs/packages/utils/api/classes/dataentity) or an array of [DataWindows](../entity/data-window.md) by a given field. If no field is configured then it will attempt to dedupe based off the `_key` metadata property. This processor can also track dates of duplicate records so that the resulting unique record has either the `oldest` or `newest` date for the date field based on the `adjust_time` parameter. ## Usage ### Dedupe records based on a field Example of a job using the `dedupe` processor ```json { "name" : "testing", "workers" : 1, "slicers" : 1, "lifecycle" : "once", "assets" : [ "standard" ], "operations" : [ { "_op": "test-reader" }, { "_op": "dedupe", "field": "name" } ] } ``` Output from example job ```javascript const data = [ { id: 1, name: 'roy' }, { id: 2, name: 'roy' }, { id: 2, name: 'bob' }, { id: 2, name: 'roy' }, { id: 3, name: 'bob' }, { id: 3, name: 'mel' } ] const results = await processor.run(data); results === [ { id: 1, name: 'roy' }, { id: 2, name: 'bob' }, { id: 3, name: 'mel' } ]; ``` ### Dedupe records based on the _key metadata Example of a job using the `_key` in the metadata ```json { "name" : "testing", "workers" : 1, "slicers" : 1, "lifecycle" : "once", "assets" : [ "standard" ], "operations" : [ { "_op": "test-reader" }, { "_op": "dedupe" } ] } ``` Output from example job ```javascript const data = [ DataEntity.make({ id: 1, name: 'roy' }, { _key: 1 }), DataEntity.make({ id: 2, name: 'roy' }, { _key: 2 }), DataEntity.make({ id: 2, name: 'bob' }, { _key: 2 }), DataEntity.make({ id: 2, name: 'roy' }, { _key: 2 }), DataEntity.make({ id: 3, name: 'bob' }, { _key: 3 }), DataEntity.make({ id: 3, name: 'mel' }, { _key: 3 }), ]; const results = await processor.run(data); results === [ { id: 1, name: 'roy' }, { id: 2, name: 'roy' }, { id: 3, name: 'bob' } ]; ``` ### Dedupe records and track time Example of a job using the `dedupe` processor and tracking the `oldest` date of the `first_seen` field as well as the `newest` date of the `last_seen` field. ```json { "name" : "testing", "workers" : 1, "slicers" : 1, "lifecycle" : "once", "assets" : [ "standard" ], "operations" : [ { "_op": "test-reader" }, { "_op": "dedupe", "field": "name", "adjust_time": [ { "field": "first_seen", "preference": "oldest" }, { "field": "last_seen", "preference": "newest" } ] } ] } ``` Output of example job ```javascript const data = [ { id: 1, name: 'roy', first_seen: '2019-05-07T20:01:00.000Z', last_seen: '2019-05-07T20:01:00.000Z' }, { id: 1, name: 'roy', first_seen: '2019-05-07T20:02:00.000Z', last_seen: '2019-05-07T20:02:00.000Z' }, { id: 1, name: 'roy', first_seen: '2019-05-07T20:04:00.000Z', last_seen: '2019-05-07T20:04:00.000Z' }, { id: 2, name: 'bob', first_seen: '2019-05-07T20:02:00.000Z', last_seen: '2019-05-07T20:02:00.000Z' }, { id: 1, name: 'roy', first_seen: '2019-05-07T20:10:00.000Z', last_seen: '2019-05-07T20:10:00.000Z' }, { id: 2, name: 'bob', first_seen: '2019-05-07T20:04:00.000Z', last_seen: '2019-05-07T20:04:00.000Z' }, { id: 3, name: 'mel', first_seen: '2019-05-07T20:04:00.000Z', last_seen: '2019-05-07T20:04:00.000Z' }, { id: 1, name: 'roy', first_seen: '2019-05-07T19:02:00.000Z', last_seen: '2019-05-07T19:02:00.000Z' }, { id: 1, name: 'roy', first_seen: '2019-05-07T20:08:00.000Z', last_seen: '2019-05-07T20:08:00.000Z' }, { id: 2, name: 'bob', first_seen: '2019-05-07T20:08:00.000Z', last_seen: '2019-05-07T20:08:00.000Z' }, { id: 3, name: 'mel', first_seen: '2019-05-07T20:01:00.000Z', last_seen: '2019-05-07T20:01:00.000Z' } ]; const results = await processor.run(data); results === [ { id: 1, name: 'roy', first_seen: '2019-05-07T19:02:00.000Z', last_seen: '2019-05-07T20:10:00.000Z' }, { id: 2, name: 'bob', first_seen: '2019-05-07T20:02:00.000Z', last_seen: '2019-05-07T20:08:00.000Z' }, { id: 3, name: 'mel', first_seen: '2019-05-07T20:01:00.000Z', last_seen: '2019-05-07T20:04:00.000Z' } ] ``` ## Parameters | Configuration | Description | Type | Notes | | --------- | -------- | ------ | ------ | | _op | Name of operation, it must reflect the exact name of the file | String | required | | field | field to dedupe records on | String | optional, defaults to `_key` metadata value | | adjust_time | Requires an array of objects with `field` and `preference` properties. Preference should be set to `oldest` or `newest`. | Array of Objects | optional, defaults to [] |