Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SQL: inconsistent results for YEAR method #40162

Closed
astefan opened this issue Mar 18, 2019 · 13 comments · Fixed by #47558
Closed

SQL: inconsistent results for YEAR method #40162

astefan opened this issue Mar 18, 2019 · 13 comments · Fixed by #47558
Assignees

Comments

@astefan
Copy link
Contributor

astefan commented Mar 18, 2019

For the following test data:

{"index":{}}
{"certified":"2019-02-21T10:44:11.857Z"}
{"index":{}}
{"certified":"2018-11-30T10:44:11.857Z"}
{"index":{}}
{"certified":"2019-01-14T10:44:11.857Z"}
{"index":{}}
{"certified":"2019-03-15T10:44:11.857Z"}
{"index":{}}
{"certified":"2018-07-31T10:44:11.857Z"}
{"index":{}}
{"certified":"2018-08-21T15:44:11.857Z"}
{"index":{}}
{"certified":"2019-03-11T15:44:11.857Z"}

and query SELECT count(*), YEAR(certified) year FROM certification WHERE certified > 0 GROUP BY year we get:

   count(*)    |     year      
---------------+---------------
3              |2017           
4              |2018           

But for a slightly modified query (by adding another grouping): SELECT count(*), YEAR(certified) year FROM certification WHERE certified > 0 GROUP BY year, certified we get:

   count(*)    |     year      
---------------+---------------
1              |2018           
1              |2018           
1              |2018           
1              |2019           
1              |2019           
1              |2019           
1              |2019           
@astefan astefan added the :Analytics/SQL SQL querying label Mar 18, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search

@matriv
Copy link
Contributor

matriv commented Mar 24, 2019

@astefan I don't get why this is an issue, it's expected behaviour.
You're adding the certified column which has unique value for all 7 entries.
To clarify the results you can add the certified to the select list: SELECT count(*), YEAR(certified) year, certified FROM test WHERE certified > 0 GROUP BY year, certified you get:

   count(*)    |     year      |       certified
---------------+---------------+------------------------
1              |2018           |2018-07-31T10:44:11.857Z
1              |2018           |2018-08-21T15:44:11.857Z
1              |2018           |2018-11-30T10:44:11.857Z
1              |2019           |2019-01-14T10:44:11.857Z
1              |2019           |2019-02-21T10:44:11.857Z
1              |2019           |2019-03-11T15:44:11.857Z
1              |2019           |2019-03-15T10:44:11.857Z

which is the expected result.

@astefan
Copy link
Contributor Author

astefan commented Mar 25, 2019

Look carefully at the first and second tests results for YEAR. In the first one you have 2017, while in the second you don't but you have 2019. It's about how intervals work in ES and @costin mentioned this is the correct behavior, but imo it's an unexpected behavior from end user point of view and we'll probably hear about this some more in the future.

@matriv
Copy link
Contributor

matriv commented Mar 25, 2019

Sorry, completely missed that. and I now I remember the discussion on this.
Imho, the YEAR() should behave the same way, regardless where and how it's used. If it's in the GROUP BY as the only grouping key it shouldn't change behavior and transform into an INTERVAL 1 year histogram. It should still extract the year as a number and use it for the grouping via a painless script as usual.

@astefan
Copy link
Contributor Author

astefan commented Jun 24, 2019

One more thing I'd like to mention, not to lose track of it: the HISTOGRAM using a 1 YEAR interval it's creating a date_histogram with 31104000000ms (from SELECT HISTOGRAM(date, INTERVAL 1 YEAR) AS h, COUNT(*) AS c FROM test GROUP BY h) while a GROUP BY YEAR(date) is creating a date_histogram with 31536000000ms. The former is computed as 12 months having 30 days and then coming up with a milliseconds number, while the latter uses 365 days year and then reaching the milliseconds number. While I understand why it's like this, I think it's small inconsistency in how we approach these scenarios.

@sophiec20
Copy link
Contributor

From an end-user perspective, I am finding the YEAR method kinda strange. My SQL background expects this to be the calendar year of the data i.e. datepart(date,'yyyy').

The following results were returned for the same data. Results B and D (which both use calendar_interval date histogram) is what I would expect to be returned. (I don't quite understand results A and C.)

[A] Using SQL

GET _sql?format=txt
{
  "query": "select YEAR(\"@timestamp\"), count(*) from \"gallery-*\" group by YEAR(\"@timestamp\")"
}

YEAR("@timestamp")|   count(*)    
------------------+---------------       
2017              |174004         
2018              |25442794  

[B] Using calendar_interval agg

GET gallery-*/_search?track_total_hits=true
{
  "size": 0,
  "aggs": {
    "time": {
      "date_histogram": {
        "field": "@timestamp",
        "calendar_interval": "1y"
      }
    }
  }
}

# results
{
  "took" : 27,
  "timed_out" : false,
  "_shards" : {
    "total" : 29,
    "successful" : 29,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 25616799,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "time" : {
      "buckets" : [
        {
          "key_as_string" : "2018-01-01T00:00:00.000Z",
          "key" : 1514764800000,
          "doc_count" : 240539
        },
        {
          "key_as_string" : "2019-01-01T00:00:00.000Z",
          "key" : 1546300800000,
          "doc_count" : 25376259
        }
      ]
    }
  }
}

[C] Using fixed_interval aggs

GET gallery-*/_search?track_total_hits=true
{
  "size": 0,
  "aggs": {
    "time": {
      "date_histogram": {
        "field": "@timestamp",
        "fixed_interval": "365d"
      }
    }
  }
}

# results
{
  "took" : 2225,
  "timed_out" : false,
  "_shards" : {
    "total" : 29,
    "successful" : 29,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 25616799,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "time" : {
      "buckets" : [
        {
          "key_as_string" : "2017-12-20T00:00:00.000Z",
          "key" : 1513728000000,
          "doc_count" : 174004
        },
        {
          "key_as_string" : "2018-12-20T00:00:00.000Z",
          "key" : 1545264000000,
          "doc_count" : 25442794
        }
      ]
    }
  }
}

[D] Using Kibana (note it is not possible to select histogram with a date field)
image

Found in 7.3.0 "build_hash" : "f8fd432", "build_date" : "2019-07-03T15:05:06.452272Z"

@sophiec20
Copy link
Contributor

looks like this will probably be resolved by #43922

@astefan
Copy link
Contributor Author

astefan commented Jul 4, 2019

@sophiec20 SQL uses, at the moment, an interval expressed in milliseconds to represent a year. And this interval is a fixed one, not counting leap years. If we take into consideration #43922, it's basically a fixed_interval.

So, the way it works now (with fixed_interval) in SQL is, imagine time starting on January 1st, 1970 and start adding to this date the number of milliseconds in a 365 days year - 31536000000ms. And every time you add that number of millis a bucket is created. 49th bucket starts on Dec 20th, 2018 and all the dates from Dec 20th, 2018 until Dec 20th, 2019 fall in the 2018 bucket. Same goes for all the dates from Dec 20th, 2017 to Dec 20th, 2018 - the will fall in the 2017 bucket.

#43922 will not fix this. The actual fix is to switch from using a 31536000000ms interval to a 1y calendar_interval one.

@astefan
Copy link
Contributor Author

astefan commented Aug 29, 2019

@solidpulse
Copy link

Yeah same issue here. After recommending ES for a new project it was a bit embarrassing to see basic queries returning wrong results lol.

but anyway, is this the best way to solve it for now ?
HISTOGRAM(year(timestamp), 1)

is there an expected patch/release date for this fix?

@skiod
Copy link

skiod commented Oct 3, 2019

@avierr

HISTOGRAM(year(timestamp), 1)

can replace YEAR ?

@solidpulse
Copy link

@avierr

HISTOGRAM(year(timestamp), 1)

can replace YEAR ?

Atleast I get a similar result, but instead of 2019, I get 2019.0

@astefan
Copy link
Contributor Author

astefan commented Oct 9, 2019

master (8.0.0): 55f5463
7.x (7.5.0): 75a7daa

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants