Feature: GROUP BY <field> #7200

ProTip · 2016-08-24T01:17:38Z

I would like to propose the ability to GROUP BY fields.

Storing high cardinality data in tags currently blows up the database for a lot of people. Even with the new index proposal it could cause significant paging and performance degradation. Further, the domain may be unbounded and more efficiently stored and processed as a field data.

I would propose that influxDB support grouping data on field values during processing. Even with very high cardinality unbounded domains would be bounded by time window or time buckets; stream processing techniques should handle this nicely.

Use Case

Analyzing access logs including request path. The request path may include ID's and be technically unbounded. A person may wish to look at an hours worth of requests grouped by time(1m), request_path to see the top requested paths by minutes. Total requests path cardinality would not exceed one hours worth of point's for SELECT, or one minute's worth of points for a SELECT INTO. Backfilling this data with the path as a tag currently crashes influxDB.

The text was updated successfully, but these errors were encountered:

jsternberg · 2016-08-25T16:25:47Z

I don't think this is possible just by the nature of how grouping works. There isn't a way to efficiently group by fields within the query engine without storing all of the returned points in memory and performing a massive amount of sorting. I think the underlying problem is the database index for tags and we are currently working on that in #7151.

@jwilder do you think this should be closed in favor of #7151?

jwilder · 2016-08-25T16:33:26Z

@jsternberg I don't see how #7151 is related to this. #7151 is about removing the in-memory index off of the heap.

jsternberg · 2016-08-25T16:37:59Z

The feature request mentioned how the current in-memory index causes problems and it makes high cardinality data impossible to use. #7151 is for making high cardinality data perform better so I figured that it would invalidate the need for this. I didn't notice that it also mentioned the new proposal, but I'm not sure it really matters since I don't know if we would be able to efficiently group by field data.

wladekb · 2016-09-16T08:52:48Z

This would also help us in a slightly different use case. We load raw pageload performance data along with metadata into a short-living measurement. We then have a set of continuous queries that aggregate the data by different axis eg. one that generates a single aggregate, and some other one that groups by country and so on.

In the current influxdb version we need to know which columns may be used for grouping so that they are loaded as tags. Moreover starting to use a new column as a dimension requires us to modify the load process so that it is put as tag. At the same time we want to keep the number of tags low to prevent generating an enormous number of series.

I could somehow summarize this use case as mini-hadoop but with better response time and flexibility.

ryanmills · 2016-12-07T19:57:58Z

Given that we can't update tags, +1 on this so that we can GROUP BY fields at a later stage as the schema changes

nathanielc · 2018-10-01T20:42:25Z

Added difficulty/high because the current mechanics around grouping leverage the index/cursors and there are no mechanisms to group by anything else. Flux is already capable of grouping by fields and time.

stale · 2019-07-23T23:33:59Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale · 2019-07-31T00:29:20Z

This issue has been automatically closed because it has not had recent activity. Please reopen if this issue is still important to you. Thank you for your contributions.

jwilder added area/queries kind/feature-request area/influxql Issues related to InfluxQL query language labels Aug 25, 2016

clarkj mentioned this issue Feb 6, 2017

DISTINCT does not return correct values #6615

Closed

nathanielc added the flux/triaged label Feb 1, 2018

nathanielc added the difficulty/high This issue needs to be broken down into smaller units of work. label Oct 1, 2018

dgnorton added the 1.x label Jan 7, 2019

stale bot added the wontfix label Jul 23, 2019

stale bot closed this as completed Jul 31, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: GROUP BY <field> #7200

Feature: GROUP BY <field> #7200

ProTip commented Aug 24, 2016

jsternberg commented Aug 25, 2016

jwilder commented Aug 25, 2016

jsternberg commented Aug 25, 2016

wladekb commented Sep 16, 2016

ryanmills commented Dec 7, 2016 •

edited

Loading

nathanielc commented Oct 1, 2018

stale bot commented Jul 23, 2019

stale bot commented Jul 31, 2019

Feature: GROUP BY <field> #7200

Feature: GROUP BY <field> #7200

Comments

ProTip commented Aug 24, 2016

Use Case

jsternberg commented Aug 25, 2016

jwilder commented Aug 25, 2016

jsternberg commented Aug 25, 2016

wladekb commented Sep 16, 2016

ryanmills commented Dec 7, 2016 • edited Loading

nathanielc commented Oct 1, 2018

stale bot commented Jul 23, 2019

stale bot commented Jul 31, 2019

ryanmills commented Dec 7, 2016 •

edited

Loading