Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conflation Aggregator #11460

Closed
nknize opened this issue Jun 2, 2015 · 18 comments
Closed

Conflation Aggregator #11460

nknize opened this issue Jun 2, 2015 · 18 comments
Labels
:Analytics/Geo Indexing, search aggregations of geo points and shapes >feature high hanging fruit stalled Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo)

Comments

@nknize
Copy link
Contributor

nknize commented Jun 2, 2015

Overview

GeoDistance search achieves the most basic spatial search use-case for "Find all points of interest within R units of known location X". This is great when the starting location is already known (e.g., cell phone GPS, travelling location of interest) but doesn't work for conflation use-cases such as: "Find all CVS Pharmacies that are within 1 mile of a Walgreens pharmacy? Or all Home Depot's that are within 2 blocks of a Lowes?" (After all, competition is essential to a free enterprise economy). The current search toolbox (query, filter, aggregations, reducer, percolator) doesn't lend itself well to these types of queries without writing a bit of complex scripting.

This feature will add - what we'll temporarily call - a ConflationAggregator. The purpose of this aggregator is to achieve the above use case. A primary aggregator defines the primary list of buckets (e.g., Home Depots). A list of secondary <filter, aggregator, query> defines the operation or post-filter to perform using the documents in the result set of the primary.

Conflation Aggregation Structure

Below is an initial cut at the colocation grammar. There can be one-to-many secondary filters to achieve multi-conflation queries, such as: "Find all Home Depots within 10 miles of a Lowes or Ace Hardware". The design is intended to be flexible enough to avoid limiting this aggregation to geo only queries.

"aggs" : {
  "<aggregation_name"> : {
    "conflation" : {
      "primary" : {
        "<filter> | <parent aggregation>"
      },
      "secondary" : [
        "<secondary_name>" : {
          "<filter> | <child aggregation>"
        }
      ]
    } 
  }
}

Example

Query

This is an initial rough idea on how to use the conflation aggregator to achieve a complex geodistance query like: "Find all Home Depots that are within 10 miles of a Lowes"

{
  "query": {"match_all": {}}, 
  "aggs": {
    "HomeDepots" : {
      "conflation" : {
        "primary" : {
          "filter" : {
            "bool" : {
              "must" : [
                {"term" : {"name" : "Home Depot"}},
                {"term" : {"businessType": "Home Improvement"}}
              ]
            }
          }
        },
        "secondary": [
          "Lowes": {
            "filter": {
              "and" : {
                "filters" : [
                  "term" : {"name" : "Lowes"},
                  "geo_distance": {
                    "field": "location",
                    "origin_field": "HomeDepots.location",
                    "ranges": [
                      {"from" : 0, "to" : 10}
                    ]
                  }
                ]
              } 
            }
          }
        ]
      }
    }
  }
}

Result

"aggregations": {
  "HomeDepots": {
     "buckets": [
        {
           "key" : "Home Depot",
           "id" : 1034,
           "doc_count": 2
        },
        {
           "key" : "Home Depot",
           "id" : 3432,
           "doc_count": 2
        },
        {
           "key" : "Home Depot",
           "id" : 5644,
           "doc_count": 1
        },
        {
           "key" : "Home Depot",
           "id" : 8999,
           "doc_count": 1
        },
        {
           "key" : "Home Depot",
           "id" : 10232,
           "doc_count": 2
        }
     ]
  }
}

This issue is open for discussion around use-cases (non-geo), design, naming, etc.

@nknize nknize added >feature discuss :Analytics/Geo Indexing, search aggregations of geo points and shapes :Analytics/Aggregations Aggregations labels Jun 2, 2015
@clintongormley
Copy link
Contributor

Hi @nknize

This sounds exciting, but I don't see how it can work in a distributed environment in a scalable way? Surely all nodes need to share all location data to get accurate results?

@nknize
Copy link
Contributor Author

nknize commented Jun 2, 2015

@clintongormley performance is certainly the primary concern. In a way this is a multi (potentially exponential) query wrapped in aggregator clothing so it would need to be limited to small use cases. That may make the feature DOA? I'm not sure. That's why I'm super interested in soliciting feedback from everyone on ideas for providing easy-to-use tools for accomplishing these types of common use-cases.

@markharwood
Copy link
Contributor

Presumably in this "find an X near a Y" we are starting with the assumption that X and Y are both different elasticsearch documents?
As Clinton points out the initial problem is that the various X and Y docs may be scattered across different shards.
It feels like a two step op -

  1. query the X docs and use a geohashgrid agg or similar to build a geo-mask
  2. query the Y docs and use the mask from 1 as a filter

@clintongormley
Copy link
Contributor

@nknize yeah - we avoid adding features that don't scale horizontally, eg see the discussion on #3278 about the potential meltdown of cross-shard joins

@nknize
Copy link
Contributor Author

nknize commented Jun 2, 2015

@markharwood That's basically the idea. The problem with using a grid is that there aren't overlaps. i.e. There may be two Home Depots whose radius buffer overlap, and 1 Lowes falls within the criteria distance for both (i.e. falls in that venn overlap) So the mask would probably be something along the lines of a boolean 'OR' of geo_distance filters created from step 1. Do we run the risk of cluster thrashing in this case?

@colings86 pointed to SignificantLongTerms as a possible reference implementation for something like this.

@markharwood
Copy link
Contributor

Sounds like 2 options for representation of the Home Depot coverage from step 1:
a) A full set of points from which you can later compute radius buffers (good for accuracy, bad if millions of points)
b) A form of multi-polygon that describes the union of the radius buffers (good if a lot of overlaps, bad for overall accuracy).

The multi-polygon in b) would not necessarily be a single-resolution grid but a set of geohashes or similar with variable resolution.

Not sure about the SignificantLongTerms connection?

@colings86
Copy link
Contributor

The reference to SignificantLongTerms was the ability to go back and query the index again in the combine phase on each shard but I hadn't thought about the fact that unlike significant terms (which relies on the assumption that the background stats on the shard are representative of the complete index) in this case you may not have all the information on the shard to be able to find all the hits.

@clintongormley
Copy link
Contributor

So the mask would probably be something along the lines of a boolean 'OR' of geo_distance filters created from step 1. Do we run the risk of cluster thrashing in this case?

If this could be represented as some constant-size data structure, regardless of how many documents/geo-points are involved, then it would work and be scalable. If it requires recording every geo-point, then cluster thrashing would indeed ensue...

@markharwood
Copy link
Contributor

It started to feel like an image compression problem to me - accuracy and space required are based on:

  1. The resolution (number of pixels) you want to spend describing a landscape and
  2. The aggressiveness of the compression algorithms that bundle nearby similar items into blocks of the same colour

@mikemccand
Copy link
Contributor

This is sort of like a geo-spatial MoreLikeThis query...

@xandyj
Copy link

xandyj commented Jun 8, 2015

You may already be planning this, but I'd like to note that it would be useful if this ConflationAggregator would be generalized so that it could support not only geospatial but also temporal correlation. Also, it would be useful for it to support nesting.

@nikonyrh
Copy link

nikonyrh commented Jun 8, 2015

I'm assuming that we want to find documents from a set A which are "nearby" a document from set B, these can be defined by any filters.

Spatiotemporal correlations could be implemented by dividing documents into buckets which have a "well defined order" and regular intervals (such as geohash and date_histogram). Implementation would is simpler if the query is executed on a pre-determined axis-aligned bounding box (2D or 3D) in this space.

First we'd query the list of non-empty buckets from each shard (and info whether they contain a document from set A, B or both), this results in two bitmaps / shard. The union of these bitmaps is formed, we'd iterate over each bucket which has a document from set A and check nearby buckets to check if they have a document from set B.

If such buckets are found then we need to launch a second query to retrieve those documents from set A. Actually each shard would have its own set of filters since we already know that which shards had documents at which buckets. I hope I managed to explain the main idea...

I'll be doing some simulations to estimate the expected amount of network traffic between nodes under different assumptions and compression methods. The good news is that response size from the first query is bounded by the number of buckets within the bounding box, not by the total number documents in each shard. Bitmaps could be replaced by actual document counts of each bucket but it would inflate data size considerably. I must admit I have zero experience in ES development, just on applied mathematics and programming.

It would be so cool to be able to run queries like "mobile phones which were near a crime scene within a few minutes in city A in last X days", naturally if we'd run this for the whole country and a long time span the number of buckets would grow significantly, also almost every bucket would have at least one mobile phone at any given time. That Home Depot example is a lot more graceful on computer resources and sets A and B are relatively sparse.

@nikonyrh
Copy link

nikonyrh commented Jun 8, 2015

If the bounding box consists of 2048 x 2048 buckets (a 2D case) it would take 23*n bits for n non-empty buckets with naive encoding (1 bit to indicate if bucket is for set A or B, 2 * 11 bits to encode its index i = x + 2048 * y), about 400 * n^0.7 bits with PNG compression and 114 * n^0.77 bits with exponential-Golomb coded indexes (delta to previous index + 1 bit for set identification).

I used http://team358.org/files/frc_records/2010_Nighttime_PopDist_1.gif as a basis for random points sampling an example image is attached here, red channel forms the set A (212.5k buckets) and green channel forms set B (212.0k buckets), PNG is 548 KB and fill rate is 0.5 * (212500 + 212000) / 2048^2 = about 5% for both channels. This corresponds to a grid size of approximately 1.31 miles or 2.1 kilometers across the whole USA which is about 2680 miles wide, assuming that the grid would fit it perfectly.

setmap

For example with exponential-Golomb code and 10% fill rate / channel of this 2048 x 2048 map from 10 nodes would create about 114 * (2048^2 * 0.1 * 2)^0.77 * (10 - 1) / 8e6 = 4.7 megabytes of data to be sent to the coordinating node. This would be sufficient to determine that which buckets contain documents from set A which have a document from set B nearby. 20% fill rate would cause 8 MB and 35% fill rate 12.3 MB of traffic.

The resulting secondary query might grow large if set B is "dense" (meaning that most buckets have at least one document from set B). Results should hold for 3D case as well but are affected by document distribution within the grid.

@markharwood
Copy link
Contributor

Another use case to consider - using a geo demographic filter which is derived from another set.
e.g. the set of geohashes from this open UK house-sales data that allows us to get the median house price in a region to give an indication of affluence.
Deriving these geohashes is not simply an "all docs that match query" filter (like your HomeDepot example) but an "all buckets that meet a threshold" post-filter for aggs e.g. all regions where 50th percentile house price is >500k.

query template

@markharwood
Copy link
Contributor

For the use case I outlined I thought I'd run through this on some real data as a practical exercise. A javascript client made the bridge manually from house sales index (0.5m houses) to crimes index (5m crimes) to do some analysis - screenshots here: https://twitter.com/elasticmark/status/608639164364550144 )

The steps were:

  1. Queried house price index, summarising by geohash and median price
  2. Selected the geohash cells with high median house values (> £750k). "Bucket reducers" may help here.
  3. Parsed the geohashes into lat/lon bounding boxes for use in many geo_bounding_box filters wrapped in a bool should array. This was messy and - it clearly needs some compression to merge adjacent geohashes into more succinct filter expressions.
  4. Ran the geo-filtered query created in 3) on the crimes index with an agg on the crime_type field to see which crimes are correlated with high-value areas.

@piotrantosik
Copy link

Hi,

Any plan to implement this feature in near future?

@wchaparro
Copy link
Member

Team discussed, currently blocked on needing Multi-pass aggregation support.

@wchaparro
Copy link
Member

This is not something we plan to implement in the near future in aggregations, and has been superceded by our focus on ESQL. Closing as not planned. If your feel strongly about this one, please let us know.

@wchaparro wchaparro closed this as not planned Won't fix, can't repro, duplicate, stale Jul 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/Geo Indexing, search aggregations of geo points and shapes >feature high hanging fruit stalled Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo)
Projects
None yet
Development

No branches or pull requests