Graph duplicate node cleansing tool #17885

elasticmachine · 2017-02-10T17:32:37Z

Original comment by @markharwood:

In datasets like Panama papers the issue of noisy duplicate data raises its head and is a major pain.
Consider the near-duplicate names in this real example:
!LINK REDACTED

To assist end-users a simple Levenshtein edit-distance on the labels typically used in a graph can be used to suggest candidates for grouping. This process would run with the click of a new "link similar" button. These suggestions can be added as dotted links between related vertices which also has the effect of pulling the related vertices closer to each other in the diagram. The end user could act on these suggested links by using existing tools to select and group vertices or perhaps hitting the undo button to remove the suggestions.
I had this implementation working to good effect on a demo using SwissLeaks data (pre-cursor to Panama papers).

elasticmachine · 2017-02-10T17:32:40Z

Original comment by @markharwood:

A similar requirement is to use the text labels of selected nodes as a tokenized query to match similar nodes not currently in the workspace. Using index patterns that span more than one index I have used this feature to connect people/companies/addresses in Panama papers to similar entities in an OFAC sanctions list. This provides a tool for linking entities from different datasets. Ideally any grouping actions the user takes to merge entities visually could optionally be preserved as an "alias" definition that the UI could use as a reference to benefit other users or repeat visits to the same datasets.
!LINK REDACTED
By using named "more like this" type queries for the labels of selected nodes we can find the most similar document (using a negative boost for existing node-terms to avoid matching what we already have). The best matching doc provides us with similar new node-terms to add to the workspace and we can see which nodes caused a match through the use of named queries so can add lines to connect the new node. Parsing the explain output also helps us understand the strength of the match from each of the query clauses so these can be used to show similarity strength in the line thickness. A dotted line is used to emphasise the difference between a hard link (panamapapers entity 1214773 is connected to panamapapers entity 10076089) and a soft link (the label of panamapapers entity 1214773 is strikingly similar to ofac entity 10725).
Of course, this technique is also useful in spotting similar nodes in one dataset e.g where panama papers folks missed a link (there are many of these!) but clearly a big benefit of this soft-linking technique is spanning datasets produced independently with no common "hard" ids shared between them like the OFAC and panama papers data.

elasticmachine · 2021-09-03T14:39:59Z

Pinging @elastic/kibana-data-discovery (Team:DataDiscovery)

elasticmachine · 2022-11-04T08:33:46Z

Pinging @elastic/kibana-visualizations @elastic/kibana-visualizations-external (Team:Visualizations)

timductive · 2024-03-28T19:58:24Z

Closing this because it's not planned to be resolved in the foreseeable future. It will be tracked in our Icebox and will be re-opened if our priorities change. Feel free to re-open if you think it should be melted sooner.

elasticmachine added the Feature:Graph Graph application feature label Apr 24, 2018

timroes added the Feature:Visualizations Generic visualization features (in case no more specific feature label is available) label Aug 8, 2018

timroes added Team:Visualizations Visualization editors, elastic-charts and infrastructure and removed Feature:Visualizations Generic visualization features (in case no more specific feature label is available) labels Sep 16, 2018

timroes added Team:DataDiscovery Discover, search (e.g. data plugin and KQL), data views, saved searches. For ES|QL, use Team:ES|QL. and removed Team:Visualizations Visualization editors, elastic-charts and infrastructure labels Sep 3, 2021

stratoula added Team:Visualizations Visualization editors, elastic-charts and infrastructure and removed Team:DataDiscovery Discover, search (e.g. data plugin and KQL), data views, saved searches. For ES|QL, use Team:ES|QL. labels Nov 4, 2022

timductive mentioned this issue Mar 28, 2024

[Icebox] Graph Improvements #179668

Open

timductive closed this as not planned Won't fix, can't repro, duplicate, stale Mar 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Graph duplicate node cleansing tool #17885

Graph duplicate node cleansing tool #17885

elasticmachine commented Feb 10, 2017

elasticmachine commented Feb 10, 2017

elasticmachine commented Sep 3, 2021

elasticmachine commented Nov 4, 2022

timductive commented Mar 28, 2024

Graph duplicate node cleansing tool #17885

Graph duplicate node cleansing tool #17885

Comments

elasticmachine commented Feb 10, 2017

elasticmachine commented Feb 10, 2017

elasticmachine commented Sep 3, 2021

elasticmachine commented Nov 4, 2022

timductive commented Mar 28, 2024