From 4b6cfb376339efb2c65580f5bb7893fc1fa3fdc2 Mon Sep 17 00:00:00 2001 From: Sue Gallagher <36747279+Sue-Gallagher@users.noreply.github.com> Date: Wed, 31 Oct 2018 07:16:36 -0700 Subject: [PATCH] [DOCS] Add info on calendar vs fixed interval. (#31638) Extensive edit to add additional information on the difference between calendar intervals and fixed-length intervals. --- .../bucket/datehistogram-aggregation.asciidoc | 247 +++++++++++++----- 1 file changed, 185 insertions(+), 62 deletions(-) diff --git a/docs/reference/aggregations/bucket/datehistogram-aggregation.asciidoc b/docs/reference/aggregations/bucket/datehistogram-aggregation.asciidoc index 1d185e80f4f96..514528da0d0bd 100644 --- a/docs/reference/aggregations/bucket/datehistogram-aggregation.asciidoc +++ b/docs/reference/aggregations/bucket/datehistogram-aggregation.asciidoc @@ -1,12 +1,129 @@ [[search-aggregations-bucket-datehistogram-aggregation]] === Date Histogram Aggregation -A multi-bucket aggregation similar to the <> except it can -only be applied on date values. Since dates are represented in Elasticsearch internally as long values, it is possible -to use the normal `histogram` on dates as well, though accuracy will be compromised. The reason for this is in the fact -that time based intervals are not fixed (think of leap years and on the number of days in a month). For this reason, -we need special support for time based data. From a functionality perspective, this histogram supports the same features -as the normal <>. The main difference is that the interval can be specified by date/time expressions. +This multi-bucket aggregation is similar to the normal +<>, but it can +only be used with date values. Because dates are represented internally in +Elasticsearch as long values, it is possible, but not as accurate, to use the +normal `histogram` on dates as well. The main difference in the two APIs is +that here the interval can be specified using date/time expressions. Time-based +data requires special support because time-based intervals are not always a +fixed length. + +==== Setting intervals + +There seems to be no limit to the creativity we humans apply to setting our +clocks and calendars. We've invented leap years and leap seconds, standard and +daylight savings times, and timezone offsets of 30 or 45 minutes rather than a +full hour. While these creations help keep us in sync with the cosmos and our +environment, they can make specifying time intervals accurately a real challenge. +The only universal truth our researchers have yet to disprove is that a +millisecond is always the same duration, and a second is always 1000 milliseconds. +Beyond that, things get complicated. + +Generally speaking, when you specify a single time unit, such as 1 hour or 1 day, you +are working with a _calendar interval_, but multiples, such as 6 hours or 3 days, are +_fixed-length intervals_. + +For example, a specification of 1 day (1d) from now is a calendar interval that +means "at +this exact time tomorrow" no matter the length of the day. A change to or from +daylight savings time that results in a 23 or 25 hour day is compensated for and the +specification of "this exact time tomorrow" is maintained. But if you specify 2 or +more days, each day must be of the same fixed duration (24 hours). In this case, if +the specified interval includes the change to or from daylight savings time, the +interval will end an hour sooner or later than you expect. + +There are similar differences to consider when you specify single versus multiple +minutes or hours. Multiple time periods longer than a day are not supported. + +Here are the valid time specifications and their meanings: + +milliseconds (ms) :: +Fixed length interval; supports multiples. + +seconds (s) :: +1000 milliseconds; fixed length interval (except for the last second of a +minute that contains a leap-second, which is 2000ms long); supports multiples. + +minutes (m) :: +All minutes begin at 00 seconds. + +* One minute (1m) is the interval between 00 seconds of the first minute and 00 +seconds of the following minute in the specified timezone, compensating for any +intervening leap seconds, so that the number of minutes and seconds past the +hour is the same at the start and end. +* Multiple minutes (__n__m) are intervals of exactly 60x1000=60,000 milliseconds +each. + +hours (h) :: +All hours begin at 00 minutes and 00 seconds. + +* One hour (1h) is the interval between 00:00 minutes of the first hour and 00:00 +minutes of the following hour in the specified timezone, compensating for any +intervening leap seconds, so that the number of minutes and seconds past the hour +is the same at the start and end. +* Multiple hours (__n__h) are intervals of exactly 60x60x1000=3,600,000 milliseconds +each. + +days (d) :: +All days begin at the earliest possible time, which is usually 00:00:00 +(midnight). + +* One day (1d) is the interval between the start of the day and the start of +of the following day in the specified timezone, compensating for any intervening +time changes. +* Multiple days (__n__d) are intervals of exactly 24x60x60x1000=86,400,000 +milliseconds each. + +weeks (w) :: + +* One week (1w) is the interval between the start day_of_week:hour:minute:second +and the same day of the week and time of the following week in the specified +timezone. +* Multiple weeks (__n__w) are not supported. + +months (M) :: + +* One month (1M) is the interval between the start day of the month and time of +day and the same day of the month and time of the following month in the specified +timezone, so that the day of the month and time of day are the same at the start +and end. +* Multiple months (__n__M) are not supported. + +quarters (q) :: + +* One quarter (1q) is the interval between the start day of the month and +time of day and the same day of the month and time of day three months later, +so that the day of the month and time of day are the same at the start and end. + +* Multiple quarters (__n__q) are not supported. + +years (y) :: + +* One year (1y) is the interval between the start day of the month and time of +day and the same day of the month and time of day the following year in the +specified timezone, so that the date and time are the same at the start and end. + +* Multiple years (__n__y) are not supported. + +NOTE: +In all cases, when the specified end time does not exist, the actual end time is +the closest available time after the specified end. + +Widely distributed applications must also consider vagaries such as countries that +start and stop daylight savings time at 12:01 A.M., so end up with one minute of +Sunday followed by an additional 59 minutes of Saturday once a year, and countries +that decide to move across the international date line. Situations like +that can make irregular timezone offsets seem easy. + +As always, rigorous testing, especially around time-change events, will ensure +that your time interval specification is +what you intend it to be. + +WARNING: +To avoid unexpected results, all connected servers and clients must sync to a +reliable network time service. + +==== Examples Requesting bucket intervals of a month. @@ -27,13 +144,11 @@ POST /sales/_search?size=0 // CONSOLE // TEST[setup:sales] -Available expressions for interval: `year` (`1y`), `quarter` (`1q`), `month` (`1M`), `week` (`1w`), -`day` (`1d`), `hour` (`1h`), `minute` (`1m`), `second` (`1s`) - -Time values can also be specified via abbreviations supported by <> parsing. -Note that fractional time values are not supported, but you can address this by shifting to another -time unit (e.g., `1.5h` could instead be specified as `90m`). Also note that time intervals larger than -days do not support arbitrary values but can only be one unit large (e.g. `1y` is valid, `2y` is not). +You can also specify time values using abbreviations supported by +<> parsing. +Note that fractional time values are not supported, but you can address this by +shifting to another +time unit (e.g., `1.5h` could instead be specified as `90m`). [source,js] -------------------------------------------------- @@ -52,15 +167,16 @@ POST /sales/_search?size=0 // CONSOLE // TEST[setup:sales] -==== Keys +===== Keys Internally, a date is represented as a 64 bit number representing a timestamp -in milliseconds-since-the-epoch. These timestamps are returned as the bucket -++key++s. The `key_as_string` is the same timestamp converted to a formatted -date string using the format specified with the `format` parameter: +in milliseconds-since-the-epoch (01/01/1970 midnight UTC). These timestamps are +returned as the ++key++ name of the bucket. The `key_as_string` is the same +timestamp converted to a formatted +date string using the `format` parameter sprcification: -TIP: If no `format` is specified, then it will use the first date -<> specified in the field mapping. +TIP: If you don't specify `format`, the first date +<> specified in the field mapping is used. [source,js] -------------------------------------------------- @@ -113,15 +229,15 @@ Response: -------------------------------------------------- // TESTRESPONSE[s/\.\.\./"took": $body.took,"timed_out": false,"_shards": $body._shards,"hits": $body.hits,/] -==== Time Zone +===== Timezone Date-times are stored in Elasticsearch in UTC. By default, all bucketing and -rounding is also done in UTC. The `time_zone` parameter can be used to indicate -that bucketing should use a different time zone. +rounding is also done in UTC. Use the `time_zone` parameter to indicate +that bucketing should use a different timezone. -Time zones may either be specified as an ISO 8601 UTC offset (e.g. `+01:00` or -`-08:00`) or as a timezone id, an identifier used in the TZ database like -`America/Los_Angeles`. +You can specify timezones as either an ISO 8601 UTC offset (e.g. `+01:00` or +`-08:00`) or as a timezone ID as specified in the IANA timezone database, +such as`America/Los_Angeles`. Consider the following example: @@ -151,7 +267,7 @@ GET my_index/_search?size=0 --------------------------------- // CONSOLE -UTC is used if no time zone is specified, which would result in both of these +If you don't specify a timezone, UTC is used. This would result in both of these documents being placed into the same day bucket, which starts at midnight UTC on 1 October 2015: @@ -174,8 +290,8 @@ on 1 October 2015: --------------------------------- // TESTRESPONSE[s/\.\.\./"took": $body.took,"timed_out": false,"_shards": $body._shards,"hits": $body.hits,/] -If a `time_zone` of `-01:00` is specified, then midnight starts at one hour before -midnight UTC: +If you specify a `time_zone` of `-01:00`, midnight in that timezone is one hour +before midnight UTC: [source,js] --------------------------------- @@ -223,28 +339,27 @@ second document falls into the bucket for 1 October 2015: // TESTRESPONSE[s/\.\.\./"took": $body.took,"timed_out": false,"_shards": $body._shards,"hits": $body.hits,/] <1> The `key_as_string` value represents midnight on each day - in the specified time zone. + in the specified timezone. WARNING: When using time zones that follow DST (daylight savings time) changes, buckets close to the moment when those changes happen can have slightly different -sizes than would be expected from the used `interval`. +sizes than you would expect from the used `interval`. For example, consider a DST start in the `CET` time zone: on 27 March 2016 at 2am, -clocks were turned forward 1 hour to 3am local time. When using `day` as `interval`, +clocks were turned forward 1 hour to 3am local time. If you use `day` as `interval`, the bucket covering that day will only hold data for 23 hours instead of the usual -24 hours for other buckets. The same is true for shorter intervals like e.g. 12h. -Here, we will have only a 11h bucket on the morning of 27 March when the DST shift +24 hours for other buckets. The same is true for shorter intervals, like 12h, +where you'll have only a 11h bucket on the morning of 27 March when the DST shift happens. +===== Offset -==== Offset - -The `offset` parameter is used to change the start value of each bucket by the +Use the `offset` parameter to change the start value of each bucket by the specified positive (`+`) or negative offset (`-`) duration, such as `1h` for an hour, or `1d` for a day. See <> for more possible time duration options. -For instance, when using an interval of `day`, each bucket runs from midnight -to midnight. Setting the `offset` parameter to `+6h` would change each bucket +For example, when using an interval of `day`, each bucket runs from midnight +to midnight. Setting the `offset` parameter to `+6h` changes each bucket to run from 6am to 6am: [source,js] @@ -301,12 +416,13 @@ documents into buckets starting at 6am: ----------------------------- // TESTRESPONSE[s/\.\.\./"took": $body.took,"timed_out": false,"_shards": $body._shards,"hits": $body.hits,/] -NOTE: The start `offset` of each bucket is calculated after the `time_zone` +NOTE: The start `offset` of each bucket is calculated after `time_zone` adjustments have been made. -==== Keyed Response +===== Keyed Response -Setting the `keyed` flag to `true` will associate a unique string key with each bucket and return the ranges as a hash rather than an array: +Setting the `keyed` flag to `true` associates a unique string key with each +bucket and returns the ranges as a hash rather than an array: [source,js] -------------------------------------------------- @@ -358,20 +474,25 @@ Response: -------------------------------------------------- // TESTRESPONSE[s/\.\.\./"took": $body.took,"timed_out": false,"_shards": $body._shards,"hits": $body.hits,/] -==== Scripts +===== Scripts -Like with the normal <>, both document level scripts and -value level scripts are supported. It is also possible to control the order of the returned buckets using the `order` -settings and filter the returned buckets based on a `min_doc_count` setting (by default all buckets between the first -bucket that matches documents and the last one are returned). This histogram also supports the `extended_bounds` -setting, which enables extending the bounds of the histogram beyond the data itself (to read more on why you'd want to -do that please refer to the explanation <>). +As with the normal <>, +both document-level scripts and +value-level scripts are supported. You can control the order of the returned +buckets using the `order` +settings and filter the returned buckets based on a `min_doc_count` setting +(by default all buckets between the first +bucket that matches documents and the last one are returned). This histogram +also supports the `extended_bounds` +setting, which enables extending the bounds of the histogram beyond the data +itself. For more information, see +<>. -==== Missing value +===== Missing value -The `missing` parameter defines how documents that are missing a value should be treated. -By default they will be ignored but it is also possible to treat them as if they -had a value. +The `missing` parameter defines how to treat documents that are missing a value. +By default, they are ignored, but it is also possible to treat them as if they +have a value. [source,js] -------------------------------------------------- @@ -391,20 +512,22 @@ POST /sales/_search?size=0 // CONSOLE // TEST[setup:sales] -<1> Documents without a value in the `publish_date` field will fall into the same bucket as documents that have the value `2000-01-01`. +<1> Documents without a value in the `publish_date` field will fall into the +same bucket as documents that have the value `2000-01-01`. -==== Order +===== Order -By default the returned buckets are sorted by their `key` ascending, though the order behaviour can be controlled using -the `order` setting. Supports the same `order` functionality as the <>. +By default the returned buckets are sorted by their `key` ascending, but you can +control the order using +the `order` setting. This setting supports the same `order` functionality as +<>. deprecated[6.0.0, Use `_key` instead of `_time` to order buckets by their dates/keys] -==== Use of a script to aggregate by day of the week +===== Using a script to aggregate by day of the week -There are some cases where date histogram can't help us, like for example, when we need -to aggregate the results by day of the week. -In this case to overcome the problem, we can use a script that returns the day of the week: +When you need to aggregate the results by day of the week, use a script that +returns the day of the week: [source,js] @@ -452,5 +575,5 @@ Response: -------------------------------------------------- // TESTRESPONSE[s/\.\.\./"took": $body.took,"timed_out": false,"_shards": $body._shards,"hits": $body.hits,/] -The response will contain all the buckets having as key the relative day of -the week: 1 for Monday, 2 for Tuesday... 7 for Sunday. +The response will contain all the buckets having the relative day of +the week as key : 1 for Monday, 2 for Tuesday... 7 for Sunday.