Conflation Aggregator #11460

nknize · 2015-06-02T16:20:20Z

Overview

GeoDistance search achieves the most basic spatial search use-case for "Find all points of interest within R units of known location X". This is great when the starting location is already known (e.g., cell phone GPS, travelling location of interest) but doesn't work for conflation use-cases such as: "Find all CVS Pharmacies that are within 1 mile of a Walgreens pharmacy? Or all Home Depot's that are within 2 blocks of a Lowes?" (After all, competition is essential to a free enterprise economy). The current search toolbox (query, filter, aggregations, reducer, percolator) doesn't lend itself well to these types of queries without writing a bit of complex scripting.

This feature will add - what we'll temporarily call - a ConflationAggregator. The purpose of this aggregator is to achieve the above use case. A primary aggregator defines the primary list of buckets (e.g., Home Depots). A list of secondary <filter, aggregator, query> defines the operation or post-filter to perform using the documents in the result set of the primary.

Conflation Aggregation Structure

Below is an initial cut at the colocation grammar. There can be one-to-many secondary filters to achieve multi-conflation queries, such as: "Find all Home Depots within 10 miles of a Lowes or Ace Hardware". The design is intended to be flexible enough to avoid limiting this aggregation to geo only queries.

"aggs" : {
  "<aggregation_name"> : {
    "conflation" : {
      "primary" : {
        "<filter> | <parent aggregation>"
      },
      "secondary" : [
        "<secondary_name>" : {
          "<filter> | <child aggregation>"
        }
      ]
    } 
  }
}

Example

Query

This is an initial rough idea on how to use the conflation aggregator to achieve a complex geodistance query like: "Find all Home Depots that are within 10 miles of a Lowes"

{
  "query": {"match_all": {}}, 
  "aggs": {
    "HomeDepots" : {
      "conflation" : {
        "primary" : {
          "filter" : {
            "bool" : {
              "must" : [
                {"term" : {"name" : "Home Depot"}},
                {"term" : {"businessType": "Home Improvement"}}
              ]
            }
          }
        },
        "secondary": [
          "Lowes": {
            "filter": {
              "and" : {
                "filters" : [
                  "term" : {"name" : "Lowes"},
                  "geo_distance": {
                    "field": "location",
                    "origin_field": "HomeDepots.location",
                    "ranges": [
                      {"from" : 0, "to" : 10}
                    ]
                  }
                ]
              } 
            }
          }
        ]
      }
    }
  }
}

Result

"aggregations": {
  "HomeDepots": {
     "buckets": [
        {
           "key" : "Home Depot",
           "id" : 1034,
           "doc_count": 2
        },
        {
           "key" : "Home Depot",
           "id" : 3432,
           "doc_count": 2
        },
        {
           "key" : "Home Depot",
           "id" : 5644,
           "doc_count": 1
        },
        {
           "key" : "Home Depot",
           "id" : 8999,
           "doc_count": 1
        },
        {
           "key" : "Home Depot",
           "id" : 10232,
           "doc_count": 2
        }
     ]
  }
}

This issue is open for discussion around use-cases (non-geo), design, naming, etc.

clintongormley · 2015-06-02T16:22:57Z

Hi @nknize

This sounds exciting, but I don't see how it can work in a distributed environment in a scalable way? Surely all nodes need to share all location data to get accurate results?

nknize · 2015-06-02T16:30:15Z

@clintongormley performance is certainly the primary concern. In a way this is a multi (potentially exponential) query wrapped in aggregator clothing so it would need to be limited to small use cases. That may make the feature DOA? I'm not sure. That's why I'm super interested in soliciting feedback from everyone on ideas for providing easy-to-use tools for accomplishing these types of common use-cases.

markharwood · 2015-06-02T16:39:43Z

Presumably in this "find an X near a Y" we are starting with the assumption that X and Y are both different elasticsearch documents?
As Clinton points out the initial problem is that the various X and Y docs may be scattered across different shards.
It feels like a two step op -

query the X docs and use a geohashgrid agg or similar to build a geo-mask
query the Y docs and use the mask from 1 as a filter

clintongormley · 2015-06-02T16:42:35Z

@nknize yeah - we avoid adding features that don't scale horizontally, eg see the discussion on #3278 about the potential meltdown of cross-shard joins

nknize · 2015-06-02T17:26:04Z

@markharwood That's basically the idea. The problem with using a grid is that there aren't overlaps. i.e. There may be two Home Depots whose radius buffer overlap, and 1 Lowes falls within the criteria distance for both (i.e. falls in that venn overlap) So the mask would probably be something along the lines of a boolean 'OR' of geo_distance filters created from step 1. Do we run the risk of cluster thrashing in this case?

@colings86 pointed to SignificantLongTerms as a possible reference implementation for something like this.

markharwood · 2015-06-02T18:14:29Z

Sounds like 2 options for representation of the Home Depot coverage from step 1:
a) A full set of points from which you can later compute radius buffers (good for accuracy, bad if millions of points)
b) A form of multi-polygon that describes the union of the radius buffers (good if a lot of overlaps, bad for overall accuracy).

The multi-polygon in b) would not necessarily be a single-resolution grid but a set of geohashes or similar with variable resolution.

Not sure about the SignificantLongTerms connection?

colings86 · 2015-06-03T09:27:52Z

The reference to SignificantLongTerms was the ability to go back and query the index again in the combine phase on each shard but I hadn't thought about the fact that unlike significant terms (which relies on the assumption that the background stats on the shard are representative of the complete index) in this case you may not have all the information on the shard to be able to find all the hits.

clintongormley · 2015-06-03T09:46:56Z

So the mask would probably be something along the lines of a boolean 'OR' of geo_distance filters created from step 1. Do we run the risk of cluster thrashing in this case?

If this could be represented as some constant-size data structure, regardless of how many documents/geo-points are involved, then it would work and be scalable. If it requires recording every geo-point, then cluster thrashing would indeed ensue...

markharwood · 2015-06-03T10:08:17Z

It started to feel like an image compression problem to me - accuracy and space required are based on:

The resolution (number of pixels) you want to spend describing a landscape and
The aggressiveness of the compression algorithms that bundle nearby similar items into blocks of the same colour

mikemccand · 2015-06-03T21:31:19Z

This is sort of like a geo-spatial MoreLikeThis query...

xandyj · 2015-06-08T13:48:51Z

You may already be planning this, but I'd like to note that it would be useful if this ConflationAggregator would be generalized so that it could support not only geospatial but also temporal correlation. Also, it would be useful for it to support nesting.

nikonyrh · 2015-06-08T14:42:22Z

I'm assuming that we want to find documents from a set A which are "nearby" a document from set B, these can be defined by any filters.

Spatiotemporal correlations could be implemented by dividing documents into buckets which have a "well defined order" and regular intervals (such as geohash and date_histogram). Implementation would is simpler if the query is executed on a pre-determined axis-aligned bounding box (2D or 3D) in this space.

First we'd query the list of non-empty buckets from each shard (and info whether they contain a document from set A, B or both), this results in two bitmaps / shard. The union of these bitmaps is formed, we'd iterate over each bucket which has a document from set A and check nearby buckets to check if they have a document from set B.

If such buckets are found then we need to launch a second query to retrieve those documents from set A. Actually each shard would have its own set of filters since we already know that which shards had documents at which buckets. I hope I managed to explain the main idea...

I'll be doing some simulations to estimate the expected amount of network traffic between nodes under different assumptions and compression methods. The good news is that response size from the first query is bounded by the number of buckets within the bounding box, not by the total number documents in each shard. Bitmaps could be replaced by actual document counts of each bucket but it would inflate data size considerably. I must admit I have zero experience in ES development, just on applied mathematics and programming.

It would be so cool to be able to run queries like "mobile phones which were near a crime scene within a few minutes in city A in last X days", naturally if we'd run this for the whole country and a long time span the number of buckets would grow significantly, also almost every bucket would have at least one mobile phone at any given time. That Home Depot example is a lot more graceful on computer resources and sets A and B are relatively sparse.

nikonyrh · 2015-06-08T20:19:45Z

If the bounding box consists of 2048 x 2048 buckets (a 2D case) it would take 23*n bits for n non-empty buckets with naive encoding (1 bit to indicate if bucket is for set A or B, 2 * 11 bits to encode its index i = x + 2048 * y), about 400 * n^0.7 bits with PNG compression and 114 * n^0.77 bits with exponential-Golomb coded indexes (delta to previous index + 1 bit for set identification).

I used http://team358.org/files/frc_records/2010_Nighttime_PopDist_1.gif as a basis for random points sampling an example image is attached here, red channel forms the set A (212.5k buckets) and green channel forms set B (212.0k buckets), PNG is 548 KB and fill rate is 0.5 * (212500 + 212000) / 2048^2 = about 5% for both channels. This corresponds to a grid size of approximately 1.31 miles or 2.1 kilometers across the whole USA which is about 2680 miles wide, assuming that the grid would fit it perfectly.

For example with exponential-Golomb code and 10% fill rate / channel of this 2048 x 2048 map from 10 nodes would create about 114 * (2048^2 * 0.1 * 2)^0.77 * (10 - 1) / 8e6 = 4.7 megabytes of data to be sent to the coordinating node. This would be sufficient to determine that which buckets contain documents from set A which have a document from set B nearby. 20% fill rate would cause 8 MB and 35% fill rate 12.3 MB of traffic.

The resulting secondary query might grow large if set B is "dense" (meaning that most buckets have at least one document from set B). Results should hold for 3D case as well but are affected by document distribution within the grid.

markharwood · 2015-06-10T09:57:57Z

Another use case to consider - using a geo demographic filter which is derived from another set.
e.g. the set of geohashes from this open UK house-sales data that allows us to get the median house price in a region to give an indication of affluence.
Deriving these geohashes is not simply an "all docs that match query" filter (like your HomeDepot example) but an "all buckets that meet a threshold" post-filter for aggs e.g. all regions where 50th percentile house price is >500k.

markharwood · 2015-06-10T14:45:38Z

For the use case I outlined I thought I'd run through this on some real data as a practical exercise. A javascript client made the bridge manually from house sales index (0.5m houses) to crimes index (5m crimes) to do some analysis - screenshots here: https://twitter.com/elasticmark/status/608639164364550144 )

The steps were:

Queried house price index, summarising by geohash and median price
Selected the geohash cells with high median house values (> £750k). "Bucket reducers" may help here.
Parsed the geohashes into lat/lon bounding boxes for use in many geo_bounding_box filters wrapped in a bool should array. This was messy and - it clearly needs some compression to merge adjacent geohashes into more succinct filter expressions.
Ran the geo-filtered query created in 3) on the crimes index with an agg on the crime_type field to see which crimes are correlated with high-value areas.

piotrantosik · 2015-11-23T20:50:56Z

Hi,

Any plan to implement this feature in near future?

wchaparro · 2022-01-19T19:35:39Z

Team discussed, currently blocked on needing Multi-pass aggregation support.

wchaparro · 2024-07-02T19:27:02Z

This is not something we plan to implement in the near future in aggregations, and has been superceded by our focus on ESQL. Closing as not planned. If your feel strongly about this one, please let us know.

nknize added >feature discuss :Analytics/Geo Indexing, search aggregations of geo points and shapes :Analytics/Aggregations Aggregations labels Jun 2, 2015

clintongormley added the high hanging fruit label Jan 18, 2016

$@polyfractal$ polyfractal removed the :Analytics/Aggregations Aggregations label Mar 22, 2018

nknize mentioned this issue Mar 23, 2018

Feature request: more geo aggregation stats and/or metrics #23584

Closed

pcsanwald added team-discuss and removed discuss labels Nov 8, 2018

$@polyfractal$ polyfractal mentioned this issue Jan 10, 2020

Multi-pass aggregation support #50863

Open

rjernst added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label May 4, 2020

wchaparro added stalled and removed team-discuss labels Jan 19, 2022

wchaparro closed this as not planned Won't fix, can't repro, duplicate, stale Jul 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Conflation Aggregator #11460

Conflation Aggregator #11460

nknize commented Jun 2, 2015

clintongormley commented Jun 2, 2015

nknize commented Jun 2, 2015

markharwood commented Jun 2, 2015

clintongormley commented Jun 2, 2015

nknize commented Jun 2, 2015

markharwood commented Jun 2, 2015

colings86 commented Jun 3, 2015

clintongormley commented Jun 3, 2015

markharwood commented Jun 3, 2015

mikemccand commented Jun 3, 2015

xandyj commented Jun 8, 2015

nikonyrh commented Jun 8, 2015

nikonyrh commented Jun 8, 2015

markharwood commented Jun 10, 2015

markharwood commented Jun 10, 2015

piotrantosik commented Nov 23, 2015

wchaparro commented Jan 19, 2022

wchaparro commented Jul 2, 2024

Conflation Aggregator #11460

Conflation Aggregator #11460

Comments

nknize commented Jun 2, 2015

Overview

Conflation Aggregation Structure

Example

Query

Result

clintongormley commented Jun 2, 2015

nknize commented Jun 2, 2015

markharwood commented Jun 2, 2015

clintongormley commented Jun 2, 2015

nknize commented Jun 2, 2015

markharwood commented Jun 2, 2015

colings86 commented Jun 3, 2015

clintongormley commented Jun 3, 2015

markharwood commented Jun 3, 2015

mikemccand commented Jun 3, 2015

xandyj commented Jun 8, 2015

nikonyrh commented Jun 8, 2015

nikonyrh commented Jun 8, 2015

markharwood commented Jun 10, 2015

markharwood commented Jun 10, 2015

piotrantosik commented Nov 23, 2015

wchaparro commented Jan 19, 2022

wchaparro commented Jul 2, 2024