-
Notifications
You must be signed in to change notification settings - Fork 14.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SIP-15] Transparent and Consistent Time Intervals #6360
Comments
Here's one more example where the current behavior can be surprising and undesired: We built an aggregation pipeline where one day's data is aggregated to |
+1 on relative expressions showing what they evaluate too instantly in the control About the inclusive
That way:
|
@mistercrunch I don't believe that suggestion would work as it wouldn't be apparent to the user which of the time intervals were being enforced and thus could potentially add more confusion. I suspect we need to do something similar to the dashboard redesign rollout and banner the charts appropriately. |
@john-bodley I think it could be made very apparent to the user, perhaps in the control's label itself. For instance the label could show the |
I was hoping to get feedback on this SIP as I feel this is a fairly egregious error and there is merit in addressing it sooner rather than later. Additionally at Airbnb we're in the process of migrating away from the Druid REST API in favor of Druid SQL and hence would like to remedy this problem prior to the migration (to ensure consistency). Note @mistercrunch I feel that the suggestion you provided still doesn't completely alleviate the problem as it doesn't address the fact that we're mixing dates with timestamps and using lexicographical ordering which means that for SQLAlchemy datasources the time period is a mix of (1) and (4) depending on the underlying temporal data types. Additionally I wonder if introducing a mechanism for having the option to use either This may be a bitter pill to swallow, but given the complexity of the problem (unknown user intent, column datatypes, etc.) the only possible feasible solution may be to make a (somewhat severe) breaking change in an upcoming release which:
Note it would then be up to each institution to send out a PSA to their users informing them of the breaking change and inform chart owners how to modify their charts if necessary. |
Regarding next steps I was planning on polling a number of users (mostly producers) at Airbnb via a questionnaire to gauge their understanding on how they think dates/times work with the hope that this can help inform us regarding how to remedy the situation. |
Here's some findings from a survey which was given to a group of data scientists at Airbnb. The results are based on 46 responses. Q1. Imagine you had data for the entire month of January (encoded as dates, e.g. 2019-01-01). Which of the following intervals make most sense?
Q2. Imagine you had data for the entire month on January (encoded as times, e.g 2019-01-01 01:23:45.678). Which of the following intervals make the most sense?
Q3. Are you aware of any discrepancies with how Superset filters dates or times for SQL databases (Hive, Presto, etc.)?
Q4. Could you please outline which discrepancies you are aware of?
Q5. Are you aware of any discrepancies with how Superset filters dates or times for Druid datasources?
Q6. Could you please outline which discrepancies you are aware of?
Q7. Currently Superset handles all temporal data as time. Should Superset support both date and time types?
|
TL;DRAfter researching other BI tools and surveying content creators we propose to address SIP-15 by making all temporal intervals be Temporal intervalsWhilst researching other BI tools it was somewhat difficult to determine via the lack of documentation how they handled intervals. To the best of my knowledge it seems that Looker uses People generally think about dates (discrete) as inclusive start/end, i.e., if one were to read a sign "Sale ends Saturday" one would expect that the sale would be through close of business on Saturday as opposed to Friday. Times (continuous) however are different and are often thought as having an exclusive end, i.e., a meeting scheduled from 9 - 10 am really is Given that we foresee time becoming more prevalent I believe it would be a mistake to make date intervals
For more detail see this post. Returning to the inclusive dates this can actually be problematic for leap years, i.e., Note this decision is not in agreement with the majority of the survey respondents, however it is worth noting that there was support for the We suggest the following changes:
ImplementationChanging the SQL connector from
† This results in additional code logic but it seems like the complexity is worthwhile given the impact of the change and the benefit in being able to see a visual diff of the change. Note this change will mostly impact charts where the time isn't explicitly shown (bar, pie, etc.). Given this mostly impacts instances where the granularity represents a rollup there may be merit in having a tooltip/alert on the chart to indicate that a period contains 6 or 8 days of data say as opposed to 7 (assuming one was aggregating at the weekly level). cc: @betodealmeida @graceguo-supercat @michellethomas @mistercrunch @soboko @sylvia-tomiyama @vylc @xtinec |
Motivation
Some of the core tenants of any BI tool are consistency and accuracy, however we have seen instances where the time interval provides misleading or potentially incorrect results due to the nuances in how Superset currently i) uses different logic depending on the connector, and ii) mixes datetime resolutions.
The time-interval conundrum
A datetime or timestamp (time for short) interval is defined by a
start
andend
time and whether the limits are inclusive ([
,]
) or exclusive ((
,)
) of. This leads to four possible time interval definitions:[start, end]
[start, end)
(start, end)
(start, end]
The underlying issue is it is not apparent to the user in the UI when they specify a time range which definition Superset is invoking. Sadly the answer depends on the connector (SQLAlchemy or Druid REST API) and in the case of SQLAlchemy engines the underlying structure of the datasource. This leads to several issues:
The later is especially concerning for chart types which obfuscate the temporal component or for time grains which use aggregation and thus potentially providing incorrect results, i.e., a user may think they are aggregating a week's worth of data (seven days) whereas the reality is they are only aggregating six or eight days due to the inclusion/exclusion logic.
Druid REST API
The Druid REST API uses the
[start, end)
definition (2) for time intervals although it is not explicitly mentioned in their documentation though is defined here.SQLAlchemy
The SQLAlchemy engines use the
[start, end]
definition (1), i.e., the time filter limits are defined via>=
and<=
conditions respectively. Note however unbeknown to the user the filter may behave like(start, end]
(4). Why? The reason is due to the potential mixing of dates and timestamps and using lexicographical order for evaluating clauses, i.e., assume that the time columnds
is defined as a date string, then a filter of the form,for a set of
ds
(datestamp) values of [2018-01-01
,2018-01-02
] results in,and
respectively. Due to the lexicographical order the
[start
actually acts like(start
which is probably not what the user expected.Note this is especially problematic for relative time periods such as
Last week
(which is relative to today) if your time column is a datestamp as in most cases the window would only contain six (rather than seven) days of data. Why? Because the the[start, end]
behaves like(start, end]
and the the data associated with theend
date doesn't exist yet. Additionally making theend
limit inclusive is actually misleading as the times are supposed to be relative to today which implies exclusive of today.Proposed Change
I propose there are two things we need to solve:
Consistency
Which of the four definitions make the most sense? I propose Druid's definition of
[start, end)
(2) makes the most sense as it guarantees to capture the entire time interval regardless of the time resolution in the data, i.e., for SQL a 24 hour interval would be expressed as:The reason not to opt for
[start, end]
(1) is this 24 hour interval could potentially be expressed as:however it assumes that the finest granularity that
time
column is defined is in seconds. In the case of milliseconds it wouldn't capture most of the last second in the 24 hour period. Also the[start, end)
definition ensures that adjacent time periods are non-under/over-lapping, i.e.,[2018-01-01 00:00:00, 2018-01-02 00:00:00)
[2018-01-02 00:00:00, 2018-01-03 00:00:00)
Secondly most relative time periods
Last day
,Last week
, etc. are from today at 12:00:00 am (exceptions include things likeLast 24 hours
which is relative to now). What's important is that these implicitly are exclusive of the reference time , i.e., we are looking at a previous period, and hence whyend)
really makes the most sense.Finally how to we address lexicographical issue caused by mixing of date and timestamp strings? Given that we explicitly define the time as time (rather than date) we should enforce that all time columns are cast/converted to a timestamp, i.e., in the case of Presto either,
(preferred) or
work.
Transparency
I sense we need to improve the UI to explicitly call out:
The interval definition
I sense a tooltip here would be suffice simply mentioning that the start is inclusive and the end is exclusive of.
The relative time
Both defaults and custom time periods are relative to some time. The custom time mentions "Relative to today" however it isn't clear if today means a time (now, 12:00:00 am, etc.) or a date. Furthermore defaults have no mention what the reference time is.
In most instances it's today at 12:00:00 am and there seems to be merit in explicitly calling this out. Additionally there may be merit in having an asterisk (or similar) when a relative time period is chosen, i.e.,
Last 7 days *
to help ground the reference.Note the mocks below are not correct when referring to relative periods where the time unit is less than a day, i.e., for any quantity of second, minute, or hour (say 48 hours) the reference time in now and thus the text should update according to the unit selected.
Concerns
I sense there are potentially three major concerns:
Migrations
If you asked a producer of a chart which of the four time interval definitions was being adhered to you would get the full gamut of responses, i.e., it's not evident to them exactly what the current logic is and thus it's not evident to us how we would migrate time intervals which used an absolute time for either the start or end. I sense the only solution here is to realize that this is a breaking change which though challenging provides mores transparency and consistency in the future. An institution would probably want to inform their customers of such a change via a PSA or similar.
Performance
At Airbnb our primary SQLAlchemy engine is Presto where the underlying table is partitioned by datestamp (denoted as
ds
). One initial concern I had was by enforcing the time column to represent a timestamp using a combination of Presto's date and time functions that a full table-scan would be required, i.e., the query planner would not be able to deduce which partitions to use, which would not be performant.Running an EXPLAIN on the following query,
results in a query plan consisting of a filter for only the
2018-01-01
ds
partition,which means the Presto engine can correctly deduce which partitions to scan. Note I'm unsure if this holds true for all engines.
Dates vs. Datetimes (Timestamps)
Is there merit in differentiating between dates (
2018-01-01
) and datetimes (2018-01-01 00:00:00
)? Dates are discrete whereas timestamps are continuous and thus the perception of the interval may differ. Additionally we think about date intervals we normally think of[start, end]
, rather than[start, end)
. For example Q1 is defined as 1 January – 31 March which is inclusive of both the start and end date, i.e.,[01/01, 03/31]
rather than[01/01, 04/01)
.One could argue that Druid is correct by using the
[start, end)
definition as it deals with timestamps whereas SQLAlchemy datasources which are probably weighted towards dates are correct in using the[start, end]
definition (excluding the issue with lexicographical ordering). There may be merit in adding explicit support for both dates and datetimes (Tableau supports both) which would require additional UI changes.New or Changed Public Interfaces
Added clarity to the time range widget.
New dependencies
None.
Migration Plan and Compatibility
There are no planed migrations however this would be a breaking change.
Rejected Alternatives
None.
to: @betodealmeida @fabianmenges @graceguo-supercat @jeffreythewang @kristw @michellethomas @mistercrunch @timifasubaa @williaster
The text was updated successfully, but these errors were encountered: