Adding a geospatial heatmap aggregation #20666

sstults · 2016-09-27T11:09:53Z

Closes #20665.

A few of implementation notes:

First, the geom parameter contains a field name and so does the agg specification itself. That's a little repetitive but I couldn't see a clean way around that (feedback welcome of course).

Second, I would have liked to have implemented this as a plugin, but several of the core-tests classes aren't available yet to the test framework. I'll open a separate ticket for discussion about that.

Third, I would have also like to have passed docValues throughout the agg tree, but docValues for shapes seem to be very experimental at this stage.

Lastly, the core Lucene heatmap collector takes sort of a top-down approach to building the counts array. It starts from a top-level IndexReaderContext, but really what we'd want there is a LeafReaderContext. I've tried to work around that, but quite possibly a refactor of the Lucene class would make the heatmap agg more efficient.

sstults · 2016-09-27T11:17:28Z

My company had previously signed the company agreement but I guess that didn't jive with the CLA check. I just signed the other two individual agreements.

karmi · 2016-09-27T11:17:31Z

Hi @sstults, we have found your signature in our records, but it seems like you have signed with a different e-mail than the one used in yout Git commit. Can you please add both of these e-mails into your Github profile (they can be hidden), so we can match your e-mails to your Github profile?

sstults · 2016-09-27T11:20:31Z

@karmi, I just added that email address. Thanks!

jpountz · 2016-09-28T13:52:08Z

This looks pretty similar to the geo-hash grid aggregation. Can you comment on what it brings?

sstults · 2016-09-28T18:49:35Z

The key difference between the geohash grid agg and the heatmap is that the geohash grid can only deal with indexed points (geo_point). This agg was built specifically to handle geo_shape fields so that they can be summarized on a map.

The output format is also different, and may be a little easier for clients to integrate with. Rather than buckets labelled with geohash prefixes and their corresponding counts, the output of the hashmap is an array of arrays of ints (which are the # of hits). That eliminates the need to map geohash prefixes to spatial areas.

jpountz · 2016-09-29T14:58:46Z

Thanks, I get it better now. I need to think more whether this should be the same aggregation, but at the very least it should be a similar response format. So I guess we'd either need to change the response format of the geo-hash grid agg to look more like this one of the format of this one to look more like the geo-hash grid agg. I can see benefits to both, ie. the geo-hash grid works better if there is some sparsity in the data while this one is more compact if most cells have data.

@epixa Any opinions about what response format would be best from a Kibana perspective? We could either have something like the geo-hash grid agg, ie.

        {
            "buckets": [
                {
                    "key": "svz",
                    "doc_count": 10964
                },
                {
                    "key": "sv8",
                    "doc_count": 3198
                },
                [...]
            ]
        }

or

        {
            "grid_level":4,
            "rows":7,
            "columns":6,
            "min_x":45.0,
            "min_y":11.25,
            "max_x":180.0,
            "max_y":90.0,
            "counts": [
                [0,0,2,1,0,0],
                [0,0,1,1,0,0],
                [0,1,1,1,0,0],
                [0,0,1,1,0,0],
                [0,0,1,1,0,0],
                [],
                []
            ]
        }

@nknize I think you've been working on different ways to encode geo shapes using points APIs, do you know how challenging it would be to have a similar aggregation for the new API? We might want to be careful about merging this kind of aggregation if the new format would make such aggregations hard.

epixa · 2016-10-05T12:20:38Z

@jpountz Sorry I've been pulled in a couple of directions here and haven't had a chance to give this some proper thought.

@thomasneirynck Do you have any opinions on the returned data shape? #20666 (comment)

…earch into feature/heatmap-disterr

sstults · 2016-10-05T12:48:05Z

The last commit adds the ability to calculate the grid level implicitly by specifying dist_err_pct and optionally dist_err. The first is a ratio of the cell area not covered by the underlying shape, whereas the second parameter measures the absolute error area.

thomasneirynck · 2016-10-05T14:56:12Z

In general, I like the proposal of the grid-like format. It is denser, and especially for contiguous data it allows for more efficient transport of the result, and it could allow for easier/faster implementations on the clients.

That said, right now, I'd stick with the key-value pair format:

sparsity seems quite common. the bucket-array handles this better. In the gridded approach, we'd have to start adding no-data values, or clutter the grid-array with nulls, ...
the implementation in Kibana doesn't take advantage of raster data-structures (yet) for rendering grids. Kibana creates a vector-shape for each grid-cell.
the cost of unhashing a geohash-string to its lon-lat bounding box is not prohibitive
it is consistent with the existing output format for point-based aggregation.

I am interested though. For a future enhancement, it would be nice to be able to specify the output format for such an aggregation. ES could optionally output a numerical array, and clients (such as Kibana) could leverage this for more efficient rendering (e.g. by loading the result straight into an image or texture).

sstults · 2016-10-05T22:10:46Z

I really like the idea of having multiple output formats. In fact the heatmaps could very easily be sent back as base-64 encoded PNGs.

Geohashes have a human-readable character representation, whereas quadtrees only have a binary representation. I don't think the actual quadtree prefix is accessible from existing queries, so instead I could send back labels like:

{
    ...
    "counts": [
            {
                "key": "8,23",
                "doc_count": 10964
            },
            {
                "key": "8,24",
                "doc_count": 3198
            }
    ]
}

The keys would be x,y coordinates on the heatmap.

I couldn't find an algorithm that maps quadtree cells to geohashes, but if someone were to point me in the right direction then we could get output to exactly match geohash grid as well.

jpountz · 2016-10-05T22:24:50Z

I'll wait for @nknize to comment since he might have ideas about the response format. On my end I don't like having multiple response formats since this would be a nightmare to maintain.

clintongormley · 2016-10-07T18:40:20Z

The geohash grid format allows us to have sub-aggs within each cell, eg the geo centroid agg, or terms, histos, whatever. The grid doesn't allow for this flexibility

nknize · 2016-10-07T22:15:32Z

Yeah, I like the motivation for having a grid aggregation compatible with geo_shape but I have some concerns with this particular approach, and I think before we can provide a grid aggregator that works with geo_shape types there are a number of smaller issues that need to be done first. Here are my concerns/recommendations:

geo_shape still relies on JTS 1.13 which is LGPL. For this reason geo_shape is an optional field type that is only available if the lucene spatial module and jts libraries are available. At minimum this code should check for library availability to avoid runtime failures.
More importantly, this feature (as its currently coded) should be a separate plugin since its designed for one goal in mind - geo_shape heatmaps. That means we should do it right the first time and fix the test dependencies before committing. Before making it a plugin, though, I think we need to take a step back and consider the next two points.
To @clintongormley point this is implemented as an InternalMetricsAggregation but it should really be a bucket aggregation. The only difference between this and geohash_grid are the grid boundaries. I think the correct approach is to instead implement a general geo_grid bucket aggregation that provides a "grid_type" parameter for selecting the grid boundary used by the aggregation. This approach enables us to provide different grid boundaries (e.g., geohash, quadtree) and allows us to implement new ones (e.g., hexagon, triangular) as needed. This idea came up when ESRI requested a similar feature using hex grids. I previously opened issue Add Triangle and Hex Grid Encoding for GeoGridAggregator #16895 to capture this refactor/generic-grid idea.
To @jpountz point, HeatmapFacetCounter relies on the PrefixTree encoding format. For geo_shape types this is in the process of being replaced by a new BKD/Points encoding that is more efficient and compact. Since its a breaking change that encoding won't likely be available until 6.0. In the meantime the general geo_grid aggregation can use the RPT based HeatmapFacetCounter for aggregating geo_shape types only when "grid_type" is set to "quadtree". Then once the BKD based shape encoding is available geo_shape types can be aggregated using any grid implementation. It will also no longer require JTS, so the library check will go away, and it can reside in core instead of a separate plugin.
To minimize impact to Kibana, I don't think the json output format should change. It should remain like a bucket aggregator that uses the key-value format. We've had the "compact output" discussion before on the MultiField Stats plugin issue (don't have that issue number handy at the moment).

sstults · 2017-01-03T21:15:25Z

@nknize, I agree. Closing the PR.

Adding a geospatial heatmap aggregation

126e1d4

sstults mentioned this pull request Sep 27, 2016

Heatmap support on ElasticSearch cga-harvard/Hypermap-Registry#204

Open

sstults added 3 commits October 4, 2016 12:42

added dist_err and dist_err_pct for grid_level calculation

9726967

fix typo in randomized test

58eb7b6

Added dist_err and dist_err_pct for grid_level calculation

5e978fa

Merge branch 'feature/heatmap-disterr' of github.com:sstults/elastics…

f9f28b5

…earch into feature/heatmap-disterr

clintongormley added >feature :Analytics/Geo Indexing, search aggregations of geo points and shapes :Analytics/Aggregations Aggregations labels Oct 7, 2016

clintongormley assigned nknize Oct 7, 2016

Add test to make sure multiple indices are properly accounted for

3acbe3a

sstults closed this Jan 3, 2017

ChrisHegarty unassigned nknize Oct 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding a geospatial heatmap aggregation #20666

Adding a geospatial heatmap aggregation #20666

sstults commented Sep 27, 2016

sstults commented Sep 27, 2016

karmi commented Sep 27, 2016

sstults commented Sep 27, 2016

jpountz commented Sep 28, 2016

sstults commented Sep 28, 2016

jpountz commented Sep 29, 2016

epixa commented Oct 5, 2016

sstults commented Oct 5, 2016

thomasneirynck commented Oct 5, 2016

sstults commented Oct 5, 2016

jpountz commented Oct 5, 2016

clintongormley commented Oct 7, 2016

nknize commented Oct 7, 2016 •

edited

Loading

sstults commented Jan 3, 2017

Adding a geospatial heatmap aggregation #20666

Adding a geospatial heatmap aggregation #20666

Conversation

sstults commented Sep 27, 2016

sstults commented Sep 27, 2016

karmi commented Sep 27, 2016

sstults commented Sep 27, 2016

jpountz commented Sep 28, 2016

sstults commented Sep 28, 2016

jpountz commented Sep 29, 2016

epixa commented Oct 5, 2016

sstults commented Oct 5, 2016

thomasneirynck commented Oct 5, 2016

sstults commented Oct 5, 2016

jpountz commented Oct 5, 2016

clintongormley commented Oct 7, 2016

nknize commented Oct 7, 2016 • edited Loading

sstults commented Jan 3, 2017

nknize commented Oct 7, 2016 •

edited

Loading