-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conflation Aggregator #11460
Comments
Hi @nknize This sounds exciting, but I don't see how it can work in a distributed environment in a scalable way? Surely all nodes need to share all location data to get accurate results? |
@clintongormley performance is certainly the primary concern. In a way this is a multi (potentially exponential) query wrapped in aggregator clothing so it would need to be limited to small use cases. That may make the feature DOA? I'm not sure. That's why I'm super interested in soliciting feedback from everyone on ideas for providing easy-to-use tools for accomplishing these types of common use-cases. |
Presumably in this "find an X near a Y" we are starting with the assumption that X and Y are both different elasticsearch documents?
|
@markharwood That's basically the idea. The problem with using a grid is that there aren't overlaps. i.e. There may be two Home Depots whose radius buffer overlap, and 1 Lowes falls within the criteria distance for both (i.e. falls in that venn overlap) So the mask would probably be something along the lines of a boolean 'OR' of geo_distance filters created from step 1. Do we run the risk of cluster thrashing in this case? @colings86 pointed to SignificantLongTerms as a possible reference implementation for something like this. |
Sounds like 2 options for representation of the Home Depot coverage from step 1: The multi-polygon in b) would not necessarily be a single-resolution grid but a set of geohashes or similar with variable resolution. Not sure about the SignificantLongTerms connection? |
The reference to SignificantLongTerms was the ability to go back and query the index again in the combine phase on each shard but I hadn't thought about the fact that unlike significant terms (which relies on the assumption that the background stats on the shard are representative of the complete index) in this case you may not have all the information on the shard to be able to find all the hits. |
If this could be represented as some constant-size data structure, regardless of how many documents/geo-points are involved, then it would work and be scalable. If it requires recording every geo-point, then cluster thrashing would indeed ensue... |
It started to feel like an image compression problem to me - accuracy and space required are based on:
|
This is sort of like a geo-spatial MoreLikeThis query... |
You may already be planning this, but I'd like to note that it would be useful if this ConflationAggregator would be generalized so that it could support not only geospatial but also temporal correlation. Also, it would be useful for it to support nesting. |
I'm assuming that we want to find documents from a set A which are "nearby" a document from set B, these can be defined by any filters. Spatiotemporal correlations could be implemented by dividing documents into buckets which have a "well defined order" and regular intervals (such as geohash and date_histogram). Implementation would is simpler if the query is executed on a pre-determined axis-aligned bounding box (2D or 3D) in this space. First we'd query the list of non-empty buckets from each shard (and info whether they contain a document from set A, B or both), this results in two bitmaps / shard. The union of these bitmaps is formed, we'd iterate over each bucket which has a document from set A and check nearby buckets to check if they have a document from set B. If such buckets are found then we need to launch a second query to retrieve those documents from set A. Actually each shard would have its own set of filters since we already know that which shards had documents at which buckets. I hope I managed to explain the main idea... I'll be doing some simulations to estimate the expected amount of network traffic between nodes under different assumptions and compression methods. The good news is that response size from the first query is bounded by the number of buckets within the bounding box, not by the total number documents in each shard. Bitmaps could be replaced by actual document counts of each bucket but it would inflate data size considerably. I must admit I have zero experience in ES development, just on applied mathematics and programming. It would be so cool to be able to run queries like "mobile phones which were near a crime scene within a few minutes in city A in last X days", naturally if we'd run this for the whole country and a long time span the number of buckets would grow significantly, also almost every bucket would have at least one mobile phone at any given time. That Home Depot example is a lot more graceful on computer resources and sets A and B are relatively sparse. |
If the bounding box consists of 2048 x 2048 buckets (a 2D case) it would take 23*n bits for n non-empty buckets with naive encoding (1 bit to indicate if bucket is for set A or B, 2 * 11 bits to encode its index i = x + 2048 * y), about 400 * n^0.7 bits with PNG compression and 114 * n^0.77 bits with exponential-Golomb coded indexes (delta to previous index + 1 bit for set identification). I used http://team358.org/files/frc_records/2010_Nighttime_PopDist_1.gif as a basis for random points sampling an example image is attached here, red channel forms the set A (212.5k buckets) and green channel forms set B (212.0k buckets), PNG is 548 KB and fill rate is 0.5 * (212500 + 212000) / 2048^2 = about 5% for both channels. This corresponds to a grid size of approximately 1.31 miles or 2.1 kilometers across the whole USA which is about 2680 miles wide, assuming that the grid would fit it perfectly. For example with exponential-Golomb code and 10% fill rate / channel of this 2048 x 2048 map from 10 nodes would create about 114 * (2048^2 * 0.1 * 2)^0.77 * (10 - 1) / 8e6 = 4.7 megabytes of data to be sent to the coordinating node. This would be sufficient to determine that which buckets contain documents from set A which have a document from set B nearby. 20% fill rate would cause 8 MB and 35% fill rate 12.3 MB of traffic. The resulting secondary query might grow large if set B is "dense" (meaning that most buckets have at least one document from set B). Results should hold for 3D case as well but are affected by document distribution within the grid. |
Another use case to consider - using a geo demographic filter which is derived from another set. |
For the use case I outlined I thought I'd run through this on some real data as a practical exercise. A javascript client made the bridge manually from house sales index (0.5m houses) to crimes index (5m crimes) to do some analysis - screenshots here: https://twitter.com/elasticmark/status/608639164364550144 ) The steps were:
|
Hi, Any plan to implement this feature in near future? |
Team discussed, currently blocked on needing Multi-pass aggregation support. |
This is not something we plan to implement in the near future in aggregations, and has been superceded by our focus on ESQL. Closing as not planned. If your feel strongly about this one, please let us know. |
Overview
GeoDistance search achieves the most basic spatial search use-case for "Find all points of interest within R units of known location X". This is great when the starting location is already known (e.g., cell phone GPS, travelling location of interest) but doesn't work for conflation use-cases such as: "Find all CVS Pharmacies that are within 1 mile of a Walgreens pharmacy? Or all Home Depot's that are within 2 blocks of a Lowes?" (After all, competition is essential to a free enterprise economy). The current search toolbox (query, filter, aggregations, reducer, percolator) doesn't lend itself well to these types of queries without writing a bit of complex scripting.
This feature will add - what we'll temporarily call - a ConflationAggregator. The purpose of this aggregator is to achieve the above use case. A primary aggregator defines the primary list of buckets (e.g., Home Depots). A list of secondary <filter, aggregator, query> defines the operation or post-filter to perform using the documents in the result set of the primary.
Conflation Aggregation Structure
Below is an initial cut at the colocation grammar. There can be one-to-many secondary filters to achieve multi-conflation queries, such as: "Find all Home Depots within 10 miles of a Lowes or Ace Hardware". The design is intended to be flexible enough to avoid limiting this aggregation to geo only queries.
Example
Query
This is an initial rough idea on how to use the conflation aggregator to achieve a complex geodistance query like: "Find all Home Depots that are within 10 miles of a Lowes"
Result
This issue is open for discussion around use-cases (non-geo), design, naming, etc.
The text was updated successfully, but these errors were encountered: