Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: include precision information in PrecisionTimestamp and PrecisionTimestampTZ literals #659

Merged
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 9 additions & 4 deletions proto/substrait/algebra.proto
Original file line number Diff line number Diff line change
Expand Up @@ -814,10 +814,8 @@ message Expression {
VarChar var_char = 22;
bytes fixed_binary = 23;
Decimal decimal = 24;
// If the precision is 6 or less then this is the microseconds since the UNIX epoch
// If the precision is more than 6 then this is the nanoseconds since the UNIX epoch
uint64 precision_timestamp = 34;
uint64 precision_timestamp_tz = 35;
PrecisionTimestamp precision_timestamp = 34;
PrecisionTimestamp precision_timestamp_tz = 35;
Struct struct = 25;
Map map = 26;
// Timestamp in units of microseconds since the UNIX epoch.
Expand Down Expand Up @@ -859,6 +857,13 @@ message Expression {
int32 scale = 3;
}

message PrecisionTimestamp {
// Sub-second precision, 0 means the value given is in seconds, 3 is milliseconds, 6 microseconds, 9 is nanoseconds
int32 precision = 1;
// Time passed since 1970-01-01 00:00:00.000000 in UTC for PrecisionTimestampTZ and unspecified timezone for PrecisionTimestamp
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Time passed since 1970-01-01 00:00:00.000000 in UTC for PrecisionTimestampTZ and unspecified timezone for PrecisionTimestamp
// Time passed since 1970-01-01 00:00:00.000000 in UTC

Sorry, one small clarification. Timestamps are always in UTC, regardless of whether or not they are Timestamp or TimestampTZ.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm is that right? I think the docs for Timestamp and TimestampTZ disagree, also that’s just the difference between the TZ version and the non-TZ version. The TZ version doesn’t carry any timezone with it, but it counts time since utc epoch so it specifies a definite point in time - while the non-TZ timestamp doesn’t, and is only useful relative to another no -TZ timestamp.

At least that’s the way I understand them, and it aligns with Spark and DataFusion timestamp types.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while the non-TZ timestamp doesn’t, and is only useful relative to another no -TZ timestamp.

My understanding is that you can still do certain things to a non-TZ timestamp:

  • Display the wall-clock time represented by the timestamp (e.g. 10PM on Thursday, July 11th, 2024)
  • Assign a timezone to it to convert it to a -TZ timestamp

In order for a consumer to do either of those things there needs to be some meaning behind the literal beyond "just some random number that has appropriate relative distance from other values in this column"

In pretty much all cases I'm aware of the non-TZ timstamps are still stored as UTC instants where the wall clock (in a UTC time zone) would display the desired value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In pretty much all cases I'm aware of the non-TZ timstamps are still stored as UTC instants where the wall clock (in a UTC time zone) would display the desired value.

I checked Spark, DataFusion, Parquet and Postgres docs, and as far as I understand them they all say the non-TZ timestamp bears no notion of timezone at all (ie. it's time passed since some epoch, not UTC epoch), while the yes-TZ timestamp is specifically time passed since UTC epoch.

My understanding is that you can still do certain things to a non-TZ timestamp:

  • Display the wall-clock time represented by the timestamp (e.g. 10PM on Thursday, July 11th, 2024)
  • Assign a timezone to it to convert it to a -TZ timestamp

Yes, the non-TZ timestamp does indicate wall clock time - but it will represent the same wall clock time of e.g. 10PM on Thursday, July 11th, 2024 no matter in which timezone you look at it. And you can convert it to a -TZ timestamp by providing information about which timezone's wall clock time it corresponds to.

In order for a consumer to do either of those things there needs to be some meaning behind the literal beyond "just some random number that has appropriate relative distance from other values in this column"

Well yes, it's not just a random number, but time passed since 1.1.1970.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well yes, it's not just a random number, but time passed since 1.1.1970.

Ok. So if a producer encodes the literal "10PM on Thursday, July 11th, 2024" as 1720735200 (seconds since 1.1.1970 in UTC) and my consumer encodes the literal "10PM on Thursday, July 11th, 2024" as 1720760400 (seconds since 1.1.1970 in PT) then I think we are going to have a problem.

Often people say "seconds since the UNIX epoch" but the UNIX epoch is defined as 1.1.1970 UTC and not 1.1.1970 in some arbitrary time zone.

the non-TZ timestamp bears no notion of timezone at all

Correct. We are supposed to think of these values semantically as a wall clock time. However, we still need to agree on how we encode wall clock times into a proto literal. If we represent wall clock times as ISO-8601 strings (with no TZ part) then it would be fine. There would be no need to mention time zones. However, if we choose to encode wall clock times as a u64 then we must mention how that works. The way that works is by encoding "the wall clock time X" as "seconds since the epoch (UTC) where a wall clock in a UTC time zone would display X"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That doesn't necessitate specifying UTC. You can as well say "seconds since the epoch (in arbitrary timezone TZ) where a wall clock in the same timezone TZ would display X". That's what I assume is the point of the no-TZ timestamp :)

Very well then. You are a consumer. I have sent you a plan with a PrecisionTimestamp literal. The proto is:

{
  precision: 0,
  value: 1720794104
}

Can you please format this as a string of the form MONTH/DAY/YEAR HOURS:MINUTES:SECONDS?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very well then. You are a consumer. I have sent you a plan with a PrecisionTimestamp literal. The proto is:

{
  precision: 0,
  value: 1720794104
}

Can you please format this as a string of the form MONTH/DAY/YEAR HOURS:MINUTES:SECONDS?

I'm not sure I understand where you're getting with this, but I believe that'd be July 12, 2024 14:21:44 (well, with a slightly different format string).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid that answer is wrong. The correct answer is July 12, 2024 7:21:44 AM.

I believe I followed the encoding rules correctly. 1720794104 is the number of seconds that have passed since 1.1.1970 PT.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I think I see my mistake. I'm thinking we are interpreting this as

"The time shown on a wall clock in UTC after X seconds have passed since Y" (in which case Y's timezone matters)

You are perhaps interpreting it as

"The time shown on a wall clock in timezone Z after X seconds have passed since Y" (in which case Y's time zone doesn't matter as long as Z matches it)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're right in how I interpret it 👍

uint64 value = 2;
}

message Map {
message KeyValue {
Literal key = 1;
Expand Down
4 changes: 3 additions & 1 deletion proto/substrait/type.proto
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ message Type {
FixedBinary fixed_binary = 23;
Decimal decimal = 24;
PrecisionTimestamp precision_timestamp = 33;
PrecisionTimestampTZ precision_timestamp_tz = 34;
PrecisionTimestampTZ precision_timestamp_tz = 34; // value is since UNIX epoch in UTC

Struct struct = 25;
List list = 27;
Expand Down Expand Up @@ -164,13 +164,15 @@ message Type {
}

message PrecisionTimestamp {
// Sub-second precision, 0 means the value given is in seconds, 3 is milliseconds, 6 microsecods, 9 is nanoseconds
// Defaults to 6
int32 precision = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related to these changes, but how does Defaults to 6 work here given that default value if precision is not set will be 0, and we can't distinguish between

  • 0 was set to indicate seconds.
  • 0 was not set intentionally

Should we remove this entirely?
@westonpace

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I did mean to remove that, as I don't think it works, like you say - 666aa60. Hope that's better!

(The commit did remove your approvals though @vbarua @westonpace )

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I agree that a "default" can't exist here. The default of 6 is really "if you see a Timestamp or TimestampTz type then you can treat it as a PrecisionTimestamp or PrecisionTimestampTz type where the precision = 6.

uint32 type_variation_reference = 2;
Nullability nullability = 3;
}

message PrecisionTimestampTZ {
// Sub-second precision, 0 means the value given is in seconds, 3 is milliseconds, 6 microsecods, 9 is nanoseconds
// Defaults to 6
int32 precision = 1;
uint32 type_variation_reference = 2;
Expand Down
Loading