Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding a geospatial heatmap aggregation #20666

Closed
wants to merge 6 commits into from
Closed

Conversation

sstults
Copy link

@sstults sstults commented Sep 27, 2016

Closes #20665.

A few of implementation notes:

First, the geom parameter contains a field name and so does the agg specification itself. That's a little repetitive but I couldn't see a clean way around that (feedback welcome of course).

Second, I would have liked to have implemented this as a plugin, but several of the core-tests classes aren't available yet to the test framework. I'll open a separate ticket for discussion about that.

Third, I would have also like to have passed docValues throughout the agg tree, but docValues for shapes seem to be very experimental at this stage.

Lastly, the core Lucene heatmap collector takes sort of a top-down approach to building the counts array. It starts from a top-level IndexReaderContext, but really what we'd want there is a LeafReaderContext. I've tried to work around that, but quite possibly a refactor of the Lucene class would make the heatmap agg more efficient.

@sstults
Copy link
Author

sstults commented Sep 27, 2016

My company had previously signed the company agreement but I guess that didn't jive with the CLA check. I just signed the other two individual agreements.

@karmi
Copy link
Contributor

karmi commented Sep 27, 2016

Hi @sstults, we have found your signature in our records, but it seems like you have signed with a different e-mail than the one used in yout Git commit. Can you please add both of these e-mails into your Github profile (they can be hidden), so we can match your e-mails to your Github profile?

@sstults
Copy link
Author

sstults commented Sep 27, 2016

@karmi, I just added that email address. Thanks!

@jpountz
Copy link
Contributor

jpountz commented Sep 28, 2016

This looks pretty similar to the geo-hash grid aggregation. Can you comment on what it brings?

@sstults
Copy link
Author

sstults commented Sep 28, 2016

The key difference between the geohash grid agg and the heatmap is that the geohash grid can only deal with indexed points (geo_point). This agg was built specifically to handle geo_shape fields so that they can be summarized on a map.

The output format is also different, and may be a little easier for clients to integrate with. Rather than buckets labelled with geohash prefixes and their corresponding counts, the output of the hashmap is an array of arrays of ints (which are the # of hits). That eliminates the need to map geohash prefixes to spatial areas.

@jpountz
Copy link
Contributor

jpountz commented Sep 29, 2016

Thanks, I get it better now. I need to think more whether this should be the same aggregation, but at the very least it should be a similar response format. So I guess we'd either need to change the response format of the geo-hash grid agg to look more like this one of the format of this one to look more like the geo-hash grid agg. I can see benefits to both, ie. the geo-hash grid works better if there is some sparsity in the data while this one is more compact if most cells have data.

@epixa Any opinions about what response format would be best from a Kibana perspective? We could either have something like the geo-hash grid agg, ie.

        {
            "buckets": [
                {
                    "key": "svz",
                    "doc_count": 10964
                },
                {
                    "key": "sv8",
                    "doc_count": 3198
                },
                [...]
            ]
        }

or

        {
            "grid_level":4,
            "rows":7,
            "columns":6,
            "min_x":45.0,
            "min_y":11.25,
            "max_x":180.0,
            "max_y":90.0,
            "counts": [
                [0,0,2,1,0,0],
                [0,0,1,1,0,0],
                [0,1,1,1,0,0],
                [0,0,1,1,0,0],
                [0,0,1,1,0,0],
                [],
                []
            ]
        }

@nknize I think you've been working on different ways to encode geo shapes using points APIs, do you know how challenging it would be to have a similar aggregation for the new API? We might want to be careful about merging this kind of aggregation if the new format would make such aggregations hard.

@epixa
Copy link
Contributor

epixa commented Oct 5, 2016

@jpountz Sorry I've been pulled in a couple of directions here and haven't had a chance to give this some proper thought.

@thomasneirynck Do you have any opinions on the returned data shape? #20666 (comment)

@sstults
Copy link
Author

sstults commented Oct 5, 2016

The last commit adds the ability to calculate the grid level implicitly by specifying dist_err_pct and optionally dist_err. The first is a ratio of the cell area not covered by the underlying shape, whereas the second parameter measures the absolute error area.

@thomasneirynck
Copy link
Contributor

In general, I like the proposal of the grid-like format. It is denser, and especially for contiguous data it allows for more efficient transport of the result, and it could allow for easier/faster implementations on the clients.

That said, right now, I'd stick with the key-value pair format:

  • sparsity seems quite common. the bucket-array handles this better. In the gridded approach, we'd have to start adding no-data values, or clutter the grid-array with nulls, ...
  • the implementation in Kibana doesn't take advantage of raster data-structures (yet) for rendering grids. Kibana creates a vector-shape for each grid-cell.
  • the cost of unhashing a geohash-string to its lon-lat bounding box is not prohibitive
  • it is consistent with the existing output format for point-based aggregation.

I am interested though. For a future enhancement, it would be nice to be able to specify the output format for such an aggregation. ES could optionally output a numerical array, and clients (such as Kibana) could leverage this for more efficient rendering (e.g. by loading the result straight into an image or texture).

@sstults
Copy link
Author

sstults commented Oct 5, 2016

I really like the idea of having multiple output formats. In fact the heatmaps could very easily be sent back as base-64 encoded PNGs.

Geohashes have a human-readable character representation, whereas quadtrees only have a binary representation. I don't think the actual quadtree prefix is accessible from existing queries, so instead I could send back labels like:

{
    ...
    "counts": [
            {
                "key": "8,23",
                "doc_count": 10964
            },
            {
                "key": "8,24",
                "doc_count": 3198
            }
    ]
}

The keys would be x,y coordinates on the heatmap.

I couldn't find an algorithm that maps quadtree cells to geohashes, but if someone were to point me in the right direction then we could get output to exactly match geohash grid as well.

@jpountz
Copy link
Contributor

jpountz commented Oct 5, 2016

I'll wait for @nknize to comment since he might have ideas about the response format. On my end I don't like having multiple response formats since this would be a nightmare to maintain.

@clintongormley
Copy link
Contributor

The geohash grid format allows us to have sub-aggs within each cell, eg the geo centroid agg, or terms, histos, whatever. The grid doesn't allow for this flexibility

@clintongormley clintongormley added >feature :Analytics/Geo Indexing, search aggregations of geo points and shapes :Analytics/Aggregations Aggregations labels Oct 7, 2016
@nknize
Copy link
Contributor

nknize commented Oct 7, 2016

Yeah, I like the motivation for having a grid aggregation compatible with geo_shape but I have some concerns with this particular approach, and I think before we can provide a grid aggregator that works with geo_shape types there are a number of smaller issues that need to be done first. Here are my concerns/recommendations:

  1. geo_shape still relies on JTS 1.13 which is LGPL. For this reason geo_shape is an optional field type that is only available if the lucene spatial module and jts libraries are available. At minimum this code should check for library availability to avoid runtime failures.
  2. More importantly, this feature (as its currently coded) should be a separate plugin since its designed for one goal in mind - geo_shape heatmaps. That means we should do it right the first time and fix the test dependencies before committing. Before making it a plugin, though, I think we need to take a step back and consider the next two points.
  3. To @clintongormley point this is implemented as an InternalMetricsAggregation but it should really be a bucket aggregation. The only difference between this and geohash_grid are the grid boundaries. I think the correct approach is to instead implement a general geo_grid bucket aggregation that provides a "grid_type" parameter for selecting the grid boundary used by the aggregation. This approach enables us to provide different grid boundaries (e.g., geohash, quadtree) and allows us to implement new ones (e.g., hexagon, triangular) as needed. This idea came up when ESRI requested a similar feature using hex grids. I previously opened issue Add Triangle and Hex Grid Encoding for GeoGridAggregator #16895 to capture this refactor/generic-grid idea.
  4. To @jpountz point, HeatmapFacetCounter relies on the PrefixTree encoding format. For geo_shape types this is in the process of being replaced by a new BKD/Points encoding that is more efficient and compact. Since its a breaking change that encoding won't likely be available until 6.0. In the meantime the general geo_grid aggregation can use the RPT based HeatmapFacetCounter for aggregating geo_shape types only when "grid_type" is set to "quadtree". Then once the BKD based shape encoding is available geo_shape types can be aggregated using any grid implementation. It will also no longer require JTS, so the library check will go away, and it can reside in core instead of a separate plugin.
  5. To minimize impact to Kibana, I don't think the json output format should change. It should remain like a bucket aggregator that uses the key-value format. We've had the "compact output" discussion before on the MultiField Stats plugin issue (don't have that issue number handy at the moment).

@sstults
Copy link
Author

sstults commented Jan 3, 2017

@nknize, I agree. Closing the PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/Aggregations Aggregations :Analytics/Geo Indexing, search aggregations of geo points and shapes >feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants