Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graph duplicate node cleansing tool #17885

Closed
Tracked by #179668
elasticmachine opened this issue Feb 10, 2017 · 4 comments
Closed
Tracked by #179668

Graph duplicate node cleansing tool #17885

elasticmachine opened this issue Feb 10, 2017 · 4 comments
Labels
Feature:Graph Graph application feature Team:Visualizations Visualization editors, elastic-charts and infrastructure

Comments

@elasticmachine
Copy link
Contributor

Original comment by @markharwood:

In datasets like Panama papers the issue of noisy duplicate data raises its head and is a major pain.
Consider the near-duplicate names in this real example:
!LINK REDACTED

To assist end-users a simple Levenshtein edit-distance on the labels typically used in a graph can be used to suggest candidates for grouping. This process would run with the click of a new "link similar" button. These suggestions can be added as dotted links between related vertices which also has the effect of pulling the related vertices closer to each other in the diagram. The end user could act on these suggested links by using existing tools to select and group vertices or perhaps hitting the undo button to remove the suggestions.
I had this implementation working to good effect on a demo using SwissLeaks data (pre-cursor to Panama papers).

@elasticmachine
Copy link
Contributor Author

Original comment by @markharwood:

A similar requirement is to use the text labels of selected nodes as a tokenized query to match similar nodes not currently in the workspace. Using index patterns that span more than one index I have used this feature to connect people/companies/addresses in Panama papers to similar entities in an OFAC sanctions list. This provides a tool for linking entities from different datasets. Ideally any grouping actions the user takes to merge entities visually could optionally be preserved as an "alias" definition that the UI could use as a reference to benefit other users or repeat visits to the same datasets.
!LINK REDACTED
By using named "more like this" type queries for the labels of selected nodes we can find the most similar document (using a negative boost for existing node-terms to avoid matching what we already have). The best matching doc provides us with similar new node-terms to add to the workspace and we can see which nodes caused a match through the use of named queries so can add lines to connect the new node. Parsing the explain output also helps us understand the strength of the match from each of the query clauses so these can be used to show similarity strength in the line thickness. A dotted line is used to emphasise the difference between a hard link (panamapapers entity 1214773 is connected to panamapapers entity 10076089) and a soft link (the label of panamapapers entity 1214773 is strikingly similar to ofac entity 10725).
Of course, this technique is also useful in spotting similar nodes in one dataset e.g where panama papers folks missed a link (there are many of these!) but clearly a big benefit of this soft-linking technique is spanning datasets produced independently with no common "hard" ids shared between them like the OFAC and panama papers data.

@elasticmachine elasticmachine added the Feature:Graph Graph application feature label Apr 24, 2018
@timroes timroes added the Feature:Visualizations Generic visualization features (in case no more specific feature label is available) label Aug 8, 2018
@timroes timroes added Team:Visualizations Visualization editors, elastic-charts and infrastructure and removed Feature:Visualizations Generic visualization features (in case no more specific feature label is available) labels Sep 16, 2018
@timroes timroes added Team:DataDiscovery Discover, search (e.g. data plugin and KQL), data views, saved searches. For ES|QL, use Team:ES|QL. and removed Team:Visualizations Visualization editors, elastic-charts and infrastructure labels Sep 3, 2021
@elasticmachine
Copy link
Contributor Author

Pinging @elastic/kibana-data-discovery (Team:DataDiscovery)

@stratoula stratoula added Team:Visualizations Visualization editors, elastic-charts and infrastructure and removed Team:DataDiscovery Discover, search (e.g. data plugin and KQL), data views, saved searches. For ES|QL, use Team:ES|QL. labels Nov 4, 2022
@elasticmachine
Copy link
Contributor Author

Pinging @elastic/kibana-visualizations @elastic/kibana-visualizations-external (Team:Visualizations)

@timductive
Copy link
Member

Closing this because it's not planned to be resolved in the foreseeable future. It will be tracked in our Icebox and will be re-opened if our priorities change. Feel free to re-open if you think it should be melted sooner.

@timductive timductive closed this as not planned Won't fix, can't repro, duplicate, stale Mar 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Graph Graph application feature Team:Visualizations Visualization editors, elastic-charts and infrastructure
Projects
None yet
Development

No branches or pull requests

4 participants