Skip to content

Variable precision datetime types

Martin Traverso edited this page May 15, 2020 · 7 revisions

Variable-precision datetime types

Goals

  • Support variable precision for the following datetime types
    • TIMESTAMP(p) WITH TIME ZONE
    • TIMESTAMP(p) WITHOUT TIME ZONE
    • TIME(p) WITH TIME ZONE
    • TIME(p) WITHOUT TIME ZONE
  • Maintain backward compatibility for the existing encoding to minimize disruption and changes to existing connectors
  • Maximize performance for common values of p (e.g., the default of p = 6 in the SQL spec, common values of p in connectors)
  • Eventually, bring Presto default in line with SQL standard (p = 6 for TIMESTAMP types and p = 0 for TIME types)

Current state

Current default precision in Presto is 3

The SQL specification defines

  • For TIMESTAMP(p) (both variants), default precision is 6
  • For TIME(p) (both variants), default precision is 0

The range of values for TIMESTAMP is 0001-01-01 to 9999-12-31 => ~10000 * 365 days (+/- leap days)

TODO: survey of common values of p in other DBs and connectors

Assumptions

  • Usages of p other than 0, 3, 6, 9 are unlikely in practice.

Open questions

  • What should be the maximum precision for each type?
  • Should the maximum be the same for all types?

Analysis

TIMESTAMP(p) WITHOUT TIME ZONE

p bits
0 39
1 42
2 45
3 49
4 52
5 55
6 59
7 62
8 65
9 69
10 72
11 75
12 79

Covering the whole range up to p = 12 requires 79 bits. For efficient handling of commonly used values of p (e.g., the default p = 3 in Presto or the default p = 6 in the SQL spec), a dual representation is beneficial, similar to the short vs long decimal treatment.

  • Short encoding: p <= 3
    • Single 64-bit long.
    • Stores units of 10^-3 seconds. Compatible with the current representation when p = 3.
  • Medium encoding 3 < p <= 6
    • Single 64-bit long
    • Stores units of 10^-6 seconds. Theoretically, it can store units of 10^-7 seconds, but for cognitive simplicity, pick a common number (p = 6 = microseconds). Usages of p = 7 are unlikely, so no need to optimize for that case.
  • Long encoding p > 6
    • 64-bit long + 16/32-bits (depending on the maximum precision we want to support)
    • First long stores units of 10^-6 seconds. Second value stores the fractional part beyond that.
    • Alternative: 10 or 12 byte binary value.
      • Option 1: maintain the split of units of 10^-6 seconds vs fractional part. Easier to extract each part efficiently for manipulation
      • Option 2: encode units of 10^-(max p) seconds. Slightly more space efficient, but requires doing decimal division to extract parts for processing.

TIMESTAMP(p) WITH TIME ZONE

Size of TIMESTAMP(p) WITHOUT TIME ZONE + 12 bits for TZ id.

p bits
0 51
1 54
2 57
3 61
4 64
5 67
6 71
7 74
8 77
9 81
10 84
11 87
12 91

TIME(p) WITHOUT TIME ZONE

TODO

TIME(p) WITHOUT TIME ZONE

TODO

Miscellaneous

  • Convert built-in datetime functions
    • CURRENT_TIMESTAMP(p)
    • LOCALTIMESTAMP(p)
    • CURRENT_TIME(p)
    • LOCALTIME(p)
  • Convert types to parametric types
    • TimestampType
    • TimestampWithTimeZoneType
    • TimeType
    • TimeWithTimeZoneType
  • How to introduce parametric types in a backward-compatible way?
  • Add casts from T(p1) to T(p2)
    • Truncation/rounding
    • Store assignment rules (see 9.2.3).xvi).2).b) in SQL:2016)
  • Update casts from T1(p1) to T2(p2)
  • Support for p != 3 in connectors
  • Usages of x instanceof TimestampType
  • Usages of x.equals(TimestampType.TIMESTAMP)
  • Usages of x == TimestampType.TIMESTAMP

Rejected options

  • Scale-aware representation: store the number of 10^-p seconds. This form makes the representation compatible with the current representation when p = 3, but:
    • requires rescaling on cast between different p
    • requires knowing p to interpret values
    • harder to debug