Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature request] MIN(time), MAX(time), FIRST(time), LAST(time), query only time field #5793

Closed
ericremoreynolds opened this issue Feb 23, 2016 · 38 comments

Comments

@ericremoreynolds
Copy link

These all return null when I tried them, but it would be very useful to be able to query for these.

Also "at least 1 non-time field must be queried" seems like a strange limitation.

@dx034
Copy link

dx034 commented Mar 9, 2016

+1
I need to keep statistics to check data quality, min?time and max_time are crucial for that.

@jsternberg
Copy link
Contributor

@ericremoreynolds can you expand? Can you give an example of finding the maximum time? I think the maximum time is just last() of some point. I think #5926 might be what you need.

@ericremoreynolds
Copy link
Author

I guess you might want to do a query like

SELECT min(time) FROM whatever_measurement WHERE some_condition

without having to explicitly reference some non-time column in the select clause.

I don't know if that makes sense?

@jsternberg
Copy link
Contributor

Wouldn't the minimum time just be the first time for a field? In situations where you don't have a GROUP BY time(...), you can do this to find the minimum time of a field.

SELECT first(value) FROM cpu

@ericremoreynolds
Copy link
Author

ericremoreynolds commented Mar 15, 2016 via email

@jsternberg
Copy link
Contributor

@pauldix what do you think about this? I've seen a few other issues where people don't really seem to care about which field they're selecting on. While it will be slower than a single column access, do we want to try and support this kind of query?

@dx034
Copy link

dx034 commented Mar 15, 2016

@jsternberg Thanks for considering this. But why should the query be slower? In worst case, a "SELECT MIN(dt) FROM cpu" would perform exactly the same internal work as "SELECT FIRST(value) FROM cpu", which also returns the timestamp. But as the value is not returned, it should theoretically even be (marginally) faster (less values to be transferred and thus less encoding/decoding of results). However, it would make code better readable as the query states what we want to select.

So without knowing a lot about the internal handling of queries, I would guess that converting MIN(dt),MAX(dt) to equivalent/existing FIRST/LAST queries should not result in additional overhead or performance loss.

@pauldix
Copy link
Member

pauldix commented Mar 15, 2016

@plinyGo because InfluxDB is a columnar database. So to do this you'd be doing hitting every field and then taking the aggregates against those. Because we don't store these statistics as a whole, you'd end up doing range scans for every field under that measurement.

@jsternberg doing this as a query would be a horrible experience because of this. The only tractable way I can think of to do this would be to keep these summary statistics, but we don't have anything on the roadmap for that yet.

I think it's difficult enough to implement that we'd really need to think about what the need is here and how we could engineer this to be performant.

@toddiuszho
Copy link

I want to measure lag time itself, so having time, aggregates of time, and time arithmetic expressions in my SELECT clause all make sense. There is literally no other field I'm interested in. Completely different visualizations will concern themselves with the other fields. But my use case is visualizing the change in lag over time.

So how do I even get lag? Let's say I have several sensors as clients that report measured fields and their metadata tags once per minute. However, if there is memory, computational, or network latency contentions, my process puts backpressure on chiming in to Influx. The result for a sensor might look like reporting in at 1:02:03, 1:03:04, 1:06:11, 1:08:45, 1:18:12, 11:19:13, etc.

My console wants to visualize lag over time and alert when any sensor is behind by more than 15 minutes. A SingleStat in Grafana would be nice. Green when <10m, Orange when >=10m, and Red when >=15m. When my alarm goes off, a nice thing would be a graph to see if lag on that sensor is spiky vs. creeping up over time vs. how it fares vs. other sensors. This can help me differentiate between a troublesome sensor vs. network-to-Influx issues for all sensors.

What I don't want is the sensor itself querying Influx, doing the time logic, and then reporting those values into a different measurement to Influx. The lag detection has to be decoupled from the sensor itself. I can't detect a troublesome sensor if I'm depending on it to reliably report its own lag! I can have a dedicated client/process do this for all sensors on their behalf, but I'm having trouble justifying this when I feel Grafana or a Continuous Query should be able to do this.

What I need is: SELECT last(now() - time) FROM sensors WHERE sensor_id =~ /$sensor_id$/ AND time < now() - 4h

This lets me Template and Repeat on sensor_id.

I can't do SELECT last(some_field) FROM sensors ... because I want to visualize the value of the time difference (lag size/duration) across time, not the value of some_field. You can't cheat and declare victory just because there's a timestamp somewhere in the JSON result. I need the difference returned as a first-class citizen so tools can visualize on it.

@TaylanT
Copy link

TaylanT commented Jul 14, 2016

+1

@dondublon
Copy link

We really need this type of query. Unability to get the primary key - this is an absurd.

@joelmarty
Copy link

+1

@pauldix
Copy link
Member

pauldix commented Dec 19, 2016

I'm going to close this out. There is no time without a corresponding field. Under the covers Influx is a columnar database where a column is:

measurement,tagset,field

# for example
cpu,host=serverA,region=west,usage_user

That combination represents a series or a column. Within that range are value, time tuples. This isn't like a row oriented database where time is stored as a column value per row. Time is stored for each series (or column).

Thus it makes no sense to grab the time without a field. If you want to get the first or last time, which is equivalent to min and max respectively, do it by asking for a specific field like this:

-- first value and time returned
select usage_user from cpu limit 1
-- last value and time returned
select usage_user from cpu order by time desc limit 1

You can also add a where time > ... and time < ... clause to find the first and last times in a given range. So you can already get the summary statistics you're looking for.

@pauldix pauldix closed this as completed Dec 19, 2016
@DaRoli
Copy link

DaRoli commented Aug 10, 2017

Please reconsider the close issue decision:
How can I get the min(time) and max(time) if I query a time range with grouping, e.g. GROUP BY time(1d)?
The query result will always use startOfDay as returned timestamp of the daily aggregated entries, thus I have no possibility to know the time of my first and last item within the aggregation.

@mdaliyan
Copy link

mdaliyan commented Apr 11, 2018

is this issue still on go? I need to know the period of time in a group by query

select min(time), max(time), sum(item_price) as price, sum(view) as view from statistics group by item_invoice

But it doesn't it sum price or return max and min of time!

name: statistics
tags: item_invoice=
time min max price view
---- --- --- ----- ----
0             7000     83

name: statistics
tags: item_invoice=067a1302-bac4-42cb-409f-c047b14417c3
time min max price view
---- --- --- ----- ----
0              2500    7

name: statistics
tags: item_invoice=cb9ed298-e355-4f6f-52ba-41b15f4afca7
time min max price view
---- --- --- ----- ----
0               6500   4

Edit: I had to add a timestamp to values to get min and max working !!! (not the best way to achieve a simple thing)

@rickalm
Copy link

rickalm commented May 9, 2018

I have a similar issue relating to time, when doing a query that is time bounded I don't want the beginning timeframe timestamp but to select which timestamp (first/last) for the series to be reported.

select last(value) from table where time >= now() - 3h group by "host"

will always give me a time as now-3h. It's not showing me the first or last timestamp but its repeating the now()-3h value for each host. When looking at the data programmatically this does nothing to help understand the true timeframe the data represents.

@candlerb
Copy link

@rickalm: the query you show works for me (influxdb 1.5.2, Ubuntu 16.04)

In the following example, I stopped collecting SNMP data from a particular host by IP address (10.12.255.1) and started collecting it by name instead (gw.example.com), so the "agent_host" tag changed.

> select last(ifHCInOctets) from interface where time >= now() - 3h group by agent_host
name: interface
tags: agent_host=10.12.255.1
time                 last
----                 ----
2018-05-26T14:35:40Z 41823387483

name: interface
tags: agent_host=gw.example.com
time                 last
----                 ----
2018-05-26T15:09:40Z 41827702059

name: interface
tags: agent_host=sw1.example.com
time                 last
----                 ----
2018-05-26T15:08:41Z 0

The switchover was made about 40 minutes ago. If I reduce the time range then that host drops out completely:

> select last(ifHCInOctets) from interface where time >= now() - 30m group by agent_host
name: interface
tags: agent_host=gw.example.com
time                 last
----                 ----
2018-05-26T15:09:40Z 41827702059

name: interface
tags: agent_host=sw1.example.com
time                 last
----                 ----
2018-05-26T15:08:41Z 20981708378

In your case, maybe there are additional time series which are messing up the query. I suggest you try group by * instead of group by "host"; this may show you additional time series that you weren't expecting. group by "host" will give you the last() entry across all time series which have the same "host" tag.

@cby016
Copy link

cby016 commented Dec 11, 2018

+1

@Tonsofattraction
Copy link

just use Clickhouse

@chuckination
Copy link

Having the same issue as well with aggregate queries. Without being able to get the min or max of the timestamps (or even just the first or last of them), I would need to do the aggregation manually outside of influx to preserve that data.

As it is now, aggregate queries completely mangle the original timestamps:

> select last(time), count(count) as count from count_progress_list where time >= 1564689300000000000 and time <= 1564689600000000000
name: count_progress_list
-------------------------
time                    last    count
1564689300000000000             1281

@ccucumber
Copy link

I have a problem of similar matter.
I want to watch for first datapoint occurence in time range, to calculate e.g. when user switched the monitoring device on today.
Returned value may be a number of minutes between start of time range and timestamp of first catched datapoint.
Solution may involve ELAPSED(), but both seems to be impossible:

  • calling ELAPSED on series having only one point
  • combining two single-point series (merging FIRST() of the series and FIRST() of timerange filled by zeros)

I thing a lot of people have similar problems.

@JC-85
Copy link

JC-85 commented Aug 13, 2019

I would also like to be able to work with min and max of the timestamps of a query.

I use influx to log statistics from machine lines in manufacturing, such as machine cycle time per minute. The KPI's machine speed over time, total machine cycles per day and total run-time per day, and displaying the result on a Grafana dashboard.

Getting the total machine cycles is fairly easy but I cant figure out a good way to display the time span between the first and last registrations per day. Maybe working directly with Influx isn't the best approach. Pulling the data into a SQL database and then do the calculations from there.
But I would much rather be able to do proper statistical analysis directly on the measurements.

@ccucumber
Copy link

ccucumber commented Aug 13, 2019

I also make statistics in manufacturing :-)
I think a solution that follows the spirit of InfluxQL the most is extending the ELAPSED() function,
allowing calculation of time elapsed not from datapoint to datapoint, but also from start of time range to datapoint and from datapoint to end of time range.

A parameter may be used - one is already used to define the time unit, why not make a second one? Maybe, it even may be the default behaviour of ELAPSED() when a single data point is passed to it.

@thepycoder
Copy link

+1

1 similar comment
@Kevin-Mc-Callister
Copy link

+1

@amfasis
Copy link

amfasis commented Nov 29, 2019

Use case: I want to be able to select the first and last timestamps per day of my solar panel measurements. The panels obviously don't produce at night, and thus the converter shuts off and stops sending data. For checking availability of the data through a cron-job I want to know at what time the solar panel switches on for each day in the year (and then using this to do some sinus-fitting).

I find it interesting behavior that the time is truncated to the start of the group, although I can see there are arguments for this. Maybe it is possible to have two time-columns? One for the start of the group and one for the actual select data point (be it first(), last(), min(), max() or perhaps others)

For my use-case I found a bit of a work-around, only valid in case there is a data column which is only increasing within the time group (for example the kWh produced that day): use the bottom(field_key, 1) or top(field_key, 1) functions. Sharing for those interested!

@let4be
Copy link

let4be commented Jan 22, 2020

any update?...

@meyers007
Copy link

I really think minimum stored time and maximum stored time (or first and last) is very much required for time series database without having to know the filed name- it is quite frustrating not to get this feature HIGHEST priority. This is the reason I switched to promotheus

@rasoul707
Copy link

any solution?

@loblab
Copy link

loblab commented Aug 17, 2020

+1
Need first(time) ~ last(time) group by tag

@heinzlea-eca
Copy link

+1

@Swantu
Copy link

Swantu commented Feb 21, 2021

+1

@toddiuszho
Copy link

I'm going to close this out. There is no time without a corresponding field. Under the covers Influx is a columnar database where a column is:

measurement,tagset,field

# for example
cpu,host=serverA,region=west,usage_user

That combination represents a series or a column. Within that range are value, time tuples. This isn't like a row oriented database where time is stored as a column value per row. Time is stored for each series (or column).

Thus it makes no sense to grab the time without a field. If you want to get the first or last time, which is equivalent to min and max respectively, do it by asking for a specific field like this:

-- first value and time returned
select usage_user from cpu limit 1
-- last value and time returned
select usage_user from cpu order by time desc limit 1

You can also add a where time > ... and time < ... clause to find the first and last times in a given range. So you can already get the summary statistics you're looking for.

With my query SELECT last(now() - time) FROM sensors WHERE sensor_id =~ /$sensor_id$/ AND time < now() - 4h I am indeed using a non-time field to query with!! The WHERE already efficiently pinpointed exactly what data to fetch. Yes, even by column store standards. Fetching is done. I've already satisfied the "at least 1 non-time field must be queried" requirement. I'm asking in the SELECT clause to whittle the result set down some before sending back to the client.

@theGOTOguy
Copy link

For anyone else who finds this in 2022, this is actually pretty straightforward if you are able to use Flux to query rather than InfluxQL. We have a number of IoT devices writing to buckets and we cannot rely on any particular device supporting a specific measurement. I can pull the timestamp of the most recent measurement with:

from(bucketID:"your_bucket_id_here") |> range(start: 0) |> sort(columns: ["_time"], desc: true) |> limit(n:1)

@max0x7ba
Copy link

max0x7ba commented Apr 29, 2022

For anyone else who finds this in 2022, this is actually pretty straightforward if you are able to use Flux to query rather than InfluxQL. We have a number of IoT devices writing to buckets and we cannot rely on any particular device supporting a specific measurement. I can pull the timestamp of the most recent measurement with:

from(bucketID:"your_bucket_id_here") |> range(start: 0) |> sort(columns: ["_time"], desc: true) |> limit(n:1)

Your query reports wrong value when when the number of records becomes large enough, your range(start: 0) only hides the problem. Look up group and last functions.

@DanaJomar
Copy link

+1

1 similar comment
@Duocervisia
Copy link

+1

@earwin
Copy link

earwin commented Oct 18, 2024

If someone still cares about this in 2024, do yourself a favor, check out Clickhouse.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests