Skip to content

Commit

Permalink
SQL: use calendar interval of 1y instead of fixed interval for groupi…
Browse files Browse the repository at this point in the history
…ng by YEAR and HISTOGRAMs (#47558)

(cherry picked from commit 55f5463)
  • Loading branch information
astefan committed Oct 9, 2019
1 parent 54c2aec commit 75a7daa
Show file tree
Hide file tree
Showing 9 changed files with 180 additions and 130 deletions.
10 changes: 10 additions & 0 deletions docs/reference/sql/functions/grouping.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,16 @@ the multiple of a day. E.g.: for `HISTOGRAM(CAST(birth_date AS DATE), INTERVAL '
actually used will be `INTERVAL '2' DAY`. If the interval specified is less than 1 day, e.g.:
`HISTOGRAM(CAST(birth_date AS DATE), INTERVAL '20' HOUR)` then the interval used will be `INTERVAL '1' DAY`.

[IMPORTANT]
All intervals specified for a date/time HISTOGRAM will use a <<search-aggregations-bucket-datehistogram-aggregation,fixed interval>>
in their `date_histogram` aggregation definition, with the notable exception of `INTERVAL '1' YEAR` where a calendar interval is used.
The choice for a calendar interval was made for having a more intuitive result for YEAR groupings. Calendar intervals consider a one year
bucket as the one starting on January 1st that specific year, whereas a fixed interval one-year-bucket considers one year as a number
of milliseconds (for example, `31536000000ms` corresponding to 365 days, 24 hours per day, 60 minutes per hour etc.). With fixed intervals,
the day of February 5th, 2019 for example, belongs to a bucket that starts on December 20th, 2018 and {es} (and implicitly {es-sql}) would
have returned the year 2018 for a date that's actually in 2019. With calendar interval this behavior is more intuitive, having the day of
February 5th, 2019 actually belonging to the 2019 year bucket.

[IMPORTANT]
Histogram in SQL cannot be applied applied on **TIME** type.
E.g.: `HISTOGRAM(CAST(birth_date AS TIME), INTERVAL '10' MINUTES)` is currently not supported.
103 changes: 51 additions & 52 deletions x-pack/plugin/sql/qa/src/main/resources/agg.csv-spec
Original file line number Diff line number Diff line change
Expand Up @@ -273,47 +273,46 @@ histogramDateTime
schema::h:ts|c:l
SELECT HISTOGRAM(birth_date, INTERVAL 1 YEAR) AS h, COUNT(*) as c FROM test_emp GROUP BY h;

h | c
--------------------+---------------
null |10
1951-04-11T00:00:00Z|1
1952-04-05T00:00:00Z|10
1953-03-31T00:00:00Z|10
1954-03-26T00:00:00Z|7
1955-03-21T00:00:00Z|4
1956-03-15T00:00:00Z|4
1957-03-10T00:00:00Z|6
1958-03-05T00:00:00Z|6
1959-02-28T00:00:00Z|9
1960-02-23T00:00:00Z|7
1961-02-17T00:00:00Z|8
1962-02-12T00:00:00Z|6
1963-02-07T00:00:00Z|7
1964-02-02T00:00:00Z|5

h | c
------------------------+---------------
null |10
1952-01-01T00:00:00.000Z|8
1953-01-01T00:00:00.000Z|11
1954-01-01T00:00:00.000Z|8
1955-01-01T00:00:00.000Z|4
1956-01-01T00:00:00.000Z|5
1957-01-01T00:00:00.000Z|4
1958-01-01T00:00:00.000Z|7
1959-01-01T00:00:00.000Z|9
1960-01-01T00:00:00.000Z|8
1961-01-01T00:00:00.000Z|8
1962-01-01T00:00:00.000Z|6
1963-01-01T00:00:00.000Z|7
1964-01-01T00:00:00.000Z|4
1965-01-01T00:00:00.000Z|1
;

histogramDateTimeWithCountAndOrder
schema::h:ts|c:l
SELECT HISTOGRAM(birth_date, INTERVAL 1 YEAR) AS h, COUNT(*) as c FROM test_emp GROUP BY h ORDER BY h DESC;

h | c
--------------------+---------------
1964-02-02T00:00:00Z|5
1963-02-07T00:00:00Z|7
1962-02-12T00:00:00Z|6
1961-02-17T00:00:00Z|8
1960-02-23T00:00:00Z|7
1959-02-28T00:00:00Z|9
1958-03-05T00:00:00Z|6
1957-03-10T00:00:00Z|6
1956-03-15T00:00:00Z|4
1955-03-21T00:00:00Z|4
1954-03-26T00:00:00Z|7
1953-03-31T00:00:00Z|10
1952-04-05T00:00:00Z|10
1951-04-11T00:00:00Z|1
null |10
h | c
------------------------+---------------
1965-01-01T00:00:00.000Z|1
1964-01-01T00:00:00.000Z|4
1963-01-01T00:00:00.000Z|7
1962-01-01T00:00:00.000Z|6
1961-01-01T00:00:00.000Z|8
1960-01-01T00:00:00.000Z|8
1959-01-01T00:00:00.000Z|9
1958-01-01T00:00:00.000Z|7
1957-01-01T00:00:00.000Z|4
1956-01-01T00:00:00.000Z|5
1955-01-01T00:00:00.000Z|4
1954-01-01T00:00:00.000Z|8
1953-01-01T00:00:00.000Z|11
1952-01-01T00:00:00.000Z|8
null |10
;

histogramDateTimeWithMonthOnTop
Expand Down Expand Up @@ -369,23 +368,23 @@ histogramGroupByWithoutAlias
schema::h:ts|c:l
SELECT HISTOGRAM(birth_date, INTERVAL 1 YEAR) AS h, COUNT(*) as c FROM test_emp GROUP BY HISTOGRAM(birth_date, INTERVAL 1 YEAR) ORDER BY h DESC;

h | c
--------------------+---------------
1964-02-02T00:00:00Z|5
1963-02-07T00:00:00Z|7
1962-02-12T00:00:00Z|6
1961-02-17T00:00:00Z|8
1960-02-23T00:00:00Z|7
1959-02-28T00:00:00Z|9
1958-03-05T00:00:00Z|6
1957-03-10T00:00:00Z|6
1956-03-15T00:00:00Z|4
1955-03-21T00:00:00Z|4
1954-03-26T00:00:00Z|7
1953-03-31T00:00:00Z|10
1952-04-05T00:00:00Z|10
1951-04-11T00:00:00Z|1
null |10
h | c
------------------------+---------------
1965-01-01T00:00:00.000Z|1
1964-01-01T00:00:00.000Z|4
1963-01-01T00:00:00.000Z|7
1962-01-01T00:00:00.000Z|6
1961-01-01T00:00:00.000Z|8
1960-01-01T00:00:00.000Z|8
1959-01-01T00:00:00.000Z|9
1958-01-01T00:00:00.000Z|7
1957-01-01T00:00:00.000Z|4
1956-01-01T00:00:00.000Z|5
1955-01-01T00:00:00.000Z|4
1954-01-01T00:00:00.000Z|8
1953-01-01T00:00:00.000Z|11
1952-01-01T00:00:00.000Z|8
null |10
;

countAll
Expand Down
34 changes: 17 additions & 17 deletions x-pack/plugin/sql/qa/src/main/resources/docs/docs.csv-spec
Original file line number Diff line number Diff line change
Expand Up @@ -811,23 +811,23 @@ schema::h:ts|c:l
SELECT HISTOGRAM(birth_date, INTERVAL 1 YEAR) AS h, COUNT(*) AS c FROM emp GROUP BY h;


h | c
--------------------+---------------
null |10
1951-04-11T00:00:00Z|1
1952-04-05T00:00:00Z|10
1953-03-31T00:00:00Z|10
1954-03-26T00:00:00Z|7
1955-03-21T00:00:00Z|4
1956-03-15T00:00:00Z|4
1957-03-10T00:00:00Z|6
1958-03-05T00:00:00Z|6
1959-02-28T00:00:00Z|9
1960-02-23T00:00:00Z|7
1961-02-17T00:00:00Z|8
1962-02-12T00:00:00Z|6
1963-02-07T00:00:00Z|7
1964-02-02T00:00:00Z|5
h | c
------------------------+---------------
null |10
1952-01-01T00:00:00.000Z|8
1953-01-01T00:00:00.000Z|11
1954-01-01T00:00:00.000Z|8
1955-01-01T00:00:00.000Z|4
1956-01-01T00:00:00.000Z|5
1957-01-01T00:00:00.000Z|4
1958-01-01T00:00:00.000Z|7
1959-01-01T00:00:00.000Z|9
1960-01-01T00:00:00.000Z|8
1961-01-01T00:00:00.000Z|8
1962-01-01T00:00:00.000Z|6
1963-01-01T00:00:00.000Z|7
1964-01-01T00:00:00.000Z|4
1965-01-01T00:00:00.000Z|1

// end::histogramDateTime
;
Expand Down
36 changes: 18 additions & 18 deletions x-pack/plugin/sql/qa/src/main/resources/math.csv-spec
Original file line number Diff line number Diff line change
Expand Up @@ -101,31 +101,31 @@ SELECT MIN(salary) mi, MAX(salary) ma, YEAR(hire_date) year, ROUND(AVG(languages

mi:i | ma:i | year:i |ROUND(AVG(languages), 1):d|TRUNCATE(AVG(languages), 1):d| COUNT(*):l
---------------+---------------+---------------+--------------------------+-----------------------------+---------------
25324 |70011 |1986 |3.0 |3.0 |15
25945 |73578 |1987 |2.9 |2.8 |9
25976 |74970 |1988 |3.0 |3.0 |13
31120 |71165 |1989 |3.1 |3.0 |12
30404 |58715 |1992 |3.0 |3.0 |3
35742 |67492 |1993 |2.8 |2.7 |4
45656 |45656 |1995 |3.0 |3.0 |1
25324 |70011 |1987 |3.0 |3.0 |15
25945 |73578 |1988 |2.9 |2.8 |9
25976 |74970 |1989 |3.0 |3.0 |13
31120 |71165 |1990 |3.1 |3.0 |12
30404 |58715 |1993 |3.0 |3.0 |3
35742 |67492 |1994 |2.8 |2.7 |4
45656 |45656 |1996 |3.0 |3.0 |1
;

minMaxRoundWithHavingRound
SELECT MIN(salary) mi, MAX(salary) ma, YEAR(hire_date) year, ROUND(AVG(languages),1), COUNT(*) FROM test_emp GROUP BY YEAR(hire_date) HAVING ROUND(AVG(languages), 1) > 2.5 ORDER BY YEAR(hire_date);

mi:i | ma:i | year:i |ROUND(AVG(languages),1):d| COUNT(*):l
---------------+---------------+---------------+-------------------------+---------------
26436 |74999 |1984 |3.1 |11
31897 |61805 |1985 |3.5 |11
25324 |70011 |1986 |3.0 |15
25945 |73578 |1987 |2.9 |9
25976 |74970 |1988 |3.0 |13
31120 |71165 |1989 |3.1 |12
32568 |65030 |1990 |3.3 |6
27215 |60781 |1991 |4.1 |8
30404 |58715 |1992 |3.0 |3
35742 |67492 |1993 |2.8 |4
45656 |45656 |1995 |3.0 |1
26436 |74999 |1985 |3.1 |11
31897 |61805 |1986 |3.5 |11
25324 |70011 |1987 |3.0 |15
25945 |73578 |1988 |2.9 |9
25976 |74970 |1989 |3.0 |13
31120 |71165 |1990 |3.1 |12
32568 |65030 |1991 |3.3 |6
27215 |60781 |1992 |4.1 |8
30404 |58715 |1993 |3.0 |3
35742 |67492 |1994 |2.8 |4
45656 |45656 |1996 |3.0 |1
;

groupByAndOrderByTruncateWithPositiveParameter
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,13 @@ public abstract class DateTimeHistogramFunction extends DateTimeFunction {
}

/**
* used for aggregration (date histogram)
* used for aggregation (date histogram)
*/
public abstract long interval();
public long fixedInterval() {
return -1;
}

public String calendarInterval() {
return null;
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -5,20 +5,20 @@
*/
package org.elasticsearch.xpack.sql.expression.function.scalar.datetime;

import org.elasticsearch.search.aggregations.bucket.histogram.DateHistogramInterval;
import org.elasticsearch.xpack.sql.expression.Expression;
import org.elasticsearch.xpack.sql.expression.function.scalar.datetime.DateTimeProcessor.DateTimeExtractor;
import org.elasticsearch.xpack.sql.tree.Source;
import org.elasticsearch.xpack.sql.tree.NodeInfo.NodeCtor2;
import org.elasticsearch.xpack.sql.tree.Source;

import java.time.ZoneId;
import java.util.concurrent.TimeUnit;

/**
* Extract the year from a datetime.
*/
public class Year extends DateTimeHistogramFunction {

private static long YEAR_IN_MILLIS = TimeUnit.DAYS.toMillis(1) * 365L;
public static String YEAR_INTERVAL = DateHistogramInterval.YEAR.toString();

public Year(Source source, Expression field, ZoneId zoneId) {
super(source, field, zoneId, DateTimeExtractor.YEAR);
Expand All @@ -45,7 +45,7 @@ public Expression orderBy() {
}

@Override
public long interval() {
return YEAR_IN_MILLIS;
public String calendarInterval() {
return YEAR_INTERVAL;
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -41,9 +41,11 @@
import org.elasticsearch.xpack.sql.expression.function.scalar.ScalarFunction;
import org.elasticsearch.xpack.sql.expression.function.scalar.datetime.DateTimeFunction;
import org.elasticsearch.xpack.sql.expression.function.scalar.datetime.DateTimeHistogramFunction;
import org.elasticsearch.xpack.sql.expression.function.scalar.datetime.Year;
import org.elasticsearch.xpack.sql.expression.function.scalar.geo.GeoShape;
import org.elasticsearch.xpack.sql.expression.function.scalar.geo.StDistance;
import org.elasticsearch.xpack.sql.expression.gen.script.ScriptTemplate;
import org.elasticsearch.xpack.sql.expression.literal.IntervalYearMonth;
import org.elasticsearch.xpack.sql.expression.literal.Intervals;
import org.elasticsearch.xpack.sql.expression.predicate.Range;
import org.elasticsearch.xpack.sql.expression.predicate.fulltext.MatchQueryPredicate;
Expand Down Expand Up @@ -109,6 +111,7 @@
import org.elasticsearch.xpack.sql.util.ReflectionUtils;

import java.time.OffsetTime;
import java.time.Period;
import java.time.ZonedDateTime;
import java.util.Arrays;
import java.util.LinkedHashMap;
Expand Down Expand Up @@ -279,7 +282,11 @@ static GroupingContext groupBy(List<? extends Expression> groupings) {
// dates are handled differently because of date histograms
if (exp instanceof DateTimeHistogramFunction) {
DateTimeHistogramFunction dthf = (DateTimeHistogramFunction) exp;
key = new GroupByDateHistogram(aggId, nameOf(exp), dthf.interval(), dthf.zoneId());
if (dthf.calendarInterval() != null) {
key = new GroupByDateHistogram(aggId, nameOf(exp), dthf.calendarInterval(), dthf.zoneId());
} else {
key = new GroupByDateHistogram(aggId, nameOf(exp), dthf.fixedInterval(), dthf.zoneId());
}
}
// all other scalar functions become a script
else if (exp instanceof ScalarFunction) {
Expand All @@ -294,19 +301,33 @@ else if (exp instanceof GroupingFunction) {

// date histogram
if (h.dataType().isDateBased()) {
long intervalAsMillis = Intervals.inMillis(h.interval());

// When the histogram in SQL is applied on DATE type instead of DATETIME, the interval
// specified is truncated to the multiple of a day. If the interval specified is less
// than 1 day, then the interval used will be `INTERVAL '1' DAY`.
if (h.dataType() == DATE) {
intervalAsMillis = DateUtils.minDayInterval(intervalAsMillis);
}

if (field instanceof FieldAttribute) {
key = new GroupByDateHistogram(aggId, nameOf(field), intervalAsMillis, h.zoneId());
} else if (field instanceof Function) {
key = new GroupByDateHistogram(aggId, ((Function) field).asScript(), intervalAsMillis, h.zoneId());
Object value = h.interval().value();
if (value instanceof IntervalYearMonth
&& ((IntervalYearMonth) value).interval().equals(Period.of(1, 0, 0))) {
String calendarInterval = Year.YEAR_INTERVAL;

// When the histogram is `INTERVAL '1' YEAR`, the interval used in the ES date_histogram will be
// a calendar_interval with value "1y". All other intervals will be fixed_intervals expressed in ms.
if (field instanceof FieldAttribute) {
key = new GroupByDateHistogram(aggId, nameOf(field), calendarInterval, h.zoneId());
} else if (field instanceof Function) {
key = new GroupByDateHistogram(aggId, ((Function) field).asScript(), calendarInterval, h.zoneId());
}
} else {
long intervalAsMillis = Intervals.inMillis(h.interval());

// When the histogram in SQL is applied on DATE type instead of DATETIME, the interval
// specified is truncated to the multiple of a day. If the interval specified is less
// than 1 day, then the interval used will be `INTERVAL '1' DAY`.
if (h.dataType() == DATE) {
intervalAsMillis = DateUtils.minDayInterval(intervalAsMillis);
}

if (field instanceof FieldAttribute) {
key = new GroupByDateHistogram(aggId, nameOf(field), intervalAsMillis, h.zoneId());
} else if (field instanceof Function) {
key = new GroupByDateHistogram(aggId, ((Function) field).asScript(), intervalAsMillis, h.zoneId());
}
}
}
// numeric histogram
Expand Down
Loading

0 comments on commit 75a7daa

Please sign in to comment.