-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature: GROUP BY <field> #7200
Comments
I don't think this is possible just by the nature of how grouping works. There isn't a way to efficiently group by fields within the query engine without storing all of the returned points in memory and performing a massive amount of sorting. I think the underlying problem is the database index for tags and we are currently working on that in #7151. @jwilder do you think this should be closed in favor of #7151? |
@jsternberg I don't see how #7151 is related to this. #7151 is about removing the in-memory index off of the heap. |
The feature request mentioned how the current in-memory index causes problems and it makes high cardinality data impossible to use. #7151 is for making high cardinality data perform better so I figured that it would invalidate the need for this. I didn't notice that it also mentioned the new proposal, but I'm not sure it really matters since I don't know if we would be able to efficiently group by field data. |
This would also help us in a slightly different use case. We load raw pageload performance data along with metadata into a short-living measurement. We then have a set of continuous queries that aggregate the data by different axis eg. one that generates a single aggregate, and some other one that groups by country and so on. In the current influxdb version we need to know which columns may be used for grouping so that they are loaded as tags. Moreover starting to use a new column as a dimension requires us to modify the load process so that it is put as tag. At the same time we want to keep the number of tags low to prevent generating an enormous number of series. I could somehow summarize this use case as mini-hadoop but with better response time and flexibility. |
Given that we can't update tags, +1 on this so that we can GROUP BY fields at a later stage as the schema changes |
Added difficulty/high because the current mechanics around grouping leverage the index/cursors and there are no mechanisms to group by anything else. Flux is already capable of grouping by fields and time. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
This issue has been automatically closed because it has not had recent activity. Please reopen if this issue is still important to you. Thank you for your contributions. |
I would like to propose the ability to GROUP BY fields.
Storing high cardinality data in tags currently blows up the database for a lot of people. Even with the new index proposal it could cause significant paging and performance degradation. Further, the domain may be unbounded and more efficiently stored and processed as a field data.
I would propose that influxDB support grouping data on field values during processing. Even with very high cardinality unbounded domains would be bounded by time window or time buckets; stream processing techniques should handle this nicely.
Use Case
Analyzing access logs including request path. The request path may include ID's and be technically unbounded. A person may wish to look at an hours worth of requests grouped by
time(1m), request_path
to see the top requested paths by minutes. Total requests path cardinality would not exceed one hours worth of point's for SELECT, or one minute's worth of points for a SELECT INTO. Backfilling this data with the path as a tag currently crashes influxDB.The text was updated successfully, but these errors were encountered: