Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Propose change timestamp casting with timezone to without timezones (also parsing of timestamps without timezones) #5827

Open
alamb opened this issue May 31, 2024 · 15 comments
Labels
enhancement Any new improvement worthy of a entry in the changelog

Comments

@alamb
Copy link
Contributor

alamb commented May 31, 2024

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
This is in the context of implementing date_bin for timestamps with timezones: apache/datafusion#10602

I made #5826 to document the behavior of casting timestamps and I found it very confusing. Specifically when you cast from Timestamp(None) to Timestamp(Some(tz)) and then back to Timetamp(None) the underlying timestamp values are changed as shown in this example

use arrow_array::Int64Array;
use arrow_array::types::{TimestampSecondType};
use arrow_cast::{cast, display};
use arrow_array::cast::AsArray;
use arrow_schema::{DataType, TimeUnit};
let data_type  = DataType::Timestamp(TimeUnit::Second, None);
let data_type_tz = DataType::Timestamp(TimeUnit::Second, Some("-05:00".into()));
let a = Int64Array::from(vec![1_000_000_000, 2_000_000_000, 3_000_000_000]);
let b = cast(&a, &data_type).unwrap(); // cast to timestamp without timezone
let b = b.as_primitive::<TimestampSecondType>(); // downcast to result type
assert_eq!(2_000_000_000, b.value(1)); // values are still the same

// Convert timestamps without a timezone to timestamps with a timezone
let c = cast(&b, &data_type_tz).unwrap();
let c = c.as_primitive::<TimestampSecondType>(); // downcast to result type
assert_eq!(2_000_018_000, c.value(1)); // value has been adjusted by offset

// Convert from timestamp with timezone back to timestamp without timezone
let d = cast(&c, &data_type).unwrap();
let d = d.as_primitive::<TimestampSecondType>(); // downcast to result type
assert_eq!(2_000_018_000, d.value(1)); // <---- **** THIS VALUE IS DIFFERENT THAN IT WAS INITITALLY
assert_eq!("2033-05-18T08:33:20", display::array_value_to_string(&d, 1).unwrap());

Thus I wanted to discuss if we should change the behavior to make it less surprising or if there was a reason to leave the current behavior

Describe the solution you'd like

I propose making casting timestamp with a timezone to timestamp without a timezone do the inverse of casting timestamp withpit a timezone to timestamp with a timezone

This would mean the final value of d in the above example is 2_000_000_000, not 2_000_018_000

Describe alternatives you've considered
Leave existing behavior

Additional context

@alamb alamb added the enhancement Any new improvement worthy of a entry in the changelog label May 31, 2024
@tustvold
Copy link
Contributor

tustvold commented Jun 1, 2024

This makes sense to me, the arrow specification doesn't specify either way but I think this would be less surprising

@tustvold
Copy link
Contributor

tustvold commented Jun 1, 2024

It turns out I actually flagged this on the original PR that altered the timezone casting logic - #4201 (comment)

The argument that convinced me in the end is that when parsing from a string we convert to UTC

let array = StringArray::from(vec!["2010-01-01T00:00:00.123456+08:00"]);
let data_type = DataType::Timestamp(TimeUnit::Nanosecond, None);
let cast = cast(&array, &data_type).unwrap();
let value = cast.as_primitive::<TimestampNanosecondType>().value_as_datetime(0).unwrap();
assert_eq!(value.to_string(), "2009-12-31 16:00:00.123456");

But perhaps we want to change this also?

@alamb
Copy link
Contributor Author

alamb commented Jun 2, 2024

The argument that convinced me in the end is that when parsing from a string we convert to UTC

I don't quite follow this example.

Given 2010-01-01T00:00:00.123456+08:00 as input, I (naively) would expect that when I cast that to a timestamp without timezone it would look like 2010-01-01T00:00:00.123456, (same thing, no timezone)

It seems in your example that it is cast to 2009-12-31 16:00:00.123456 (likely because the underlying timestamp value was not changed)

Thus this seems like the same behavior and that it would be changed with the proposal.

But I may be missing something

@tustvold
Copy link
Contributor

tustvold commented Jun 2, 2024

Right so you're also advocating changing the parsing behaviour, i.e. casting from strings to timestamps, and not just between timestamps.

I agree these should be kept consistent, hence why I raised it

@alamb alamb changed the title Discussion: Change timestamp casting with timezone to without timezones Discussion: Change timestamp casting with timezone to without timezones (also parsing of timestamps without timezones) Jun 3, 2024
@alamb alamb changed the title Discussion: Change timestamp casting with timezone to without timezones (also parsing of timestamps without timezones) Propose change timestamp casting with timezone to without timezones (also parsing of timestamps without timezones) Jul 3, 2024
@alamb
Copy link
Contributor Author

alamb commented Jul 3, 2024

I posted this proposal to the dev list https://lists.apache.org/thread/b6hhsthy9pqhwmjjkox2lbt4qz9zvlvw and on slack/discord to raise awareness

@joellubi
Copy link
Member

joellubi commented Jul 3, 2024

Thanks for raising this. I think part of the issue is that timestamps without a timezone specifically omit semantics that could be used to convert to/from a specific instant. Given these semantics are lost on conversion, I wouldn't expect the values to roundtrip unless some additional information provided when going from tz-naive to tz-aware.

When casting from tz-aware to tz-naive timestamps, absent any other information, it makes sense to me that physical values are unchanged and the tz is stripped. That's what it currently does. Going the other way leaves more room for interpretation as pointed out by the docs.

Currently casting the other way converts physical values to be UTC-normalized. Perhaps this behavior would be better reserved for a dedicated "conversion" function that takes some additional parameter(s) to specify semantics. Then casting could be treated as a much simpler operation that just adds/strips the specified tz without changing physical values. That would allow your example to roundtrip.

@findepi
Copy link
Member

findepi commented Jul 3, 2024

I propose making casting timestamp with a timezone to timestamp without a timezone do the inverse of casting timestamp withpit a timezone to timestamp with a timezone

in SQL specification

  • casting from timestamp with time zone to timestamp (without time zone) discards zone, and retains year/month/day/hour/minute/second fields
  • casting from timestamp (without zone) to timestamp with time zone retains year/month/day/hour/minute/second fields and adds session time zone

So in standard SQL, one cast is not exactly inverse of the other.

@westonpace
Copy link
Member

I agree with @alamb 's proposal.

I think the question is what the value value=1262332800, unit=seconds, tz="america/los_angeles" represents?

  • Option 1: This is Friday, January 1, 2010 8:00:00 AM in Los Angeles
  • Option 2: This is Friday, January 1, 2010 12:00:00 AM in Los Angeles

Personally, I think option 1 is the most rational choice. The value represents the UTC instant (always UTC, only one choice) and the tz tells you how functions should interpret the value (e.g. how to convert it to a string if that is desired).

Option 2 is strange. The value represents "a wall clock time" and the tz tells you the "timezone at which the wall clock time should be interpreted". The way a wall clock time is stored is "a timestamp (there may be multiple) where a wall clock in the UTC time zone would show the desired wall clock time".

If we pick option 1 then the only thing casting does is change how we want to interpret the value in cases where the time zone matters:

  • Unit extraction (e.g. calculating the day of week)
  • Conversion to string (e.g. to ISO 8601 format)
  • Etc.

The value should never change when casting.

@tustvold
Copy link
Contributor

tustvold commented Jul 3, 2024

I think the question is what the value value=1262332800, unit=seconds, tz="america/los_angeles" represents?

I don't think this is under question, as this is specified by the arrow format specification - https://github.com/apache/arrow/blob/main/format/Schema.fbs#L276.

The question instead concerns how casting between timestamps should behave where one or other lacks a timezone. The specification has the following to say about this:

/// However, if a Timestamp column has no timezone value, changing it to a
/// non-empty value requires to think about the desired semantics.
/// One possibility is to assume that the original timestamp values are
/// relative to the epoch of the timezone being set; timestamp values should
/// then adjusted to the Unix epoch (for example, changing the timezone from
/// empty to "Europe/Paris" would require converting the timestamp values
/// from "Europe/Paris" to "UTC", which seems counter-intuitive but is
/// nevertheless correct).

Which is consistent with what we implement

let a = StringArray::from_iter_values([
    "2033-05-18T08:33:20",
    "2033-05-18T08:33:20Z",
    "2033-05-18T08:33:20 +01:00",
]);

let no_timezone = cast(&a, &DataType::Timestamp(TimeUnit::Nanosecond, None)).unwrap();
let back = cast(&no_timezone, &DataType::Utf8).unwrap();
assert_eq!(
    back.as_string::<i32>(),
    &StringArray::from_iter_values([
        "2033-05-18T08:33:20",
        "2033-05-18T08:33:20",
        "2033-05-18T07:33:20" <---- this issue is proposing changing this to 2033-05-18T08:33:20
    ])
);

let with_timezone = cast(
    &a,
    &DataType::Timestamp(TimeUnit::Nanosecond, Some("+01:00".into())),
)
.unwrap();
let back = cast(&with_timezone, &DataType::Utf8).unwrap();
assert_eq!(
    back.as_string::<i32>(),
    &StringArray::from_iter_values([
        "2033-05-18T08:33:20+01:00",
        "2033-05-18T09:33:20+01:00",
        "2033-05-18T08:33:20+01:00",
    ])
)

Where the specification is ambiguous is what to do when going in the reverse direction, i.e. from a timestamp with a timezone to one without a timezone. Currently we use the time in the UTC epoch. i.e. 2033-05-18T08:33:20 +01:00 -> 2033-05-18T07:33:20. As @alamb rightly points out this is not an inverse of the opposite direction, where 2033-05-18T08:33:20 -> 2033-05-18T08:33:20 +01:00, as recommended by the arrow specification.

The proposal, as far as I understand it, is for 2033-05-18T08:33:20 +01:00 -> 2033-05-18T08:33:20, making the operations "round-trip".

The value should never change when casting.

The arrow format explicitly calls out that the value should change

Edit: It is perhaps worth noting that in the case of a timestamp on the daylight savings boundary, taking the local timestamp instead of UTC as we do currently loses information, and casting back to that timezone will yield an ambiguous timestamp error. I am not sure if this matters.

Edit edit: If we do opt to change as proposed, the old behaviour could be obtained by first converting to the UTC epoch. This would be a strictly metadata operation, as the values are already UTC when a timezone is present. The proposed behaviour is therefore strictly more expressive.

@findepi
Copy link
Member

findepi commented Jul 3, 2024

The proposal, as far as I understand it, is for 2033-05-18T08:33:20 +01:00 -> 2033-05-18T08:33:20,

if indeed proposed so, then it would be in line with SQL spec and so in line with what a few other engines do.

@alamb
Copy link
Contributor Author

alamb commented Jul 15, 2024

An update here is that we think we have a workaround for our usecase that we have added to DataFusion (a to_local_time function): apache/datafusion#11401

I am not sure there is consensus that this change of behavior is desired (though it is not clear to me that there is consensus it would not be desired either)

Edit: It is perhaps worth noting that in the case of a timestamp on the daylight savings boundary, taking the local timestamp instead of UTC as we do currently loses information, and casting back to that timezone will yield an ambiguous timestamp error. I am not sure if this matters.

This worries me as it sounds like it means it would be impossible to make timestamps roundtripping in all cases

It isn't clear to me that "never error but doesn't round trip" to "round trips but sometimes errors" is an improvement in functionality

@Omega359
Copy link
Contributor

Postgresql has what I would call 'sane' behaviour in the face of ambiguous timestamps. While I think that PG's has some really odd behaviour with multiple 'at time zone's (see this comment) I think in this case the direction they went with is a solid defensible one. If arrow-rs was to emulate that behaviour then round trips wouldn't error but may possibly result in a different time.

Has anyone looked into what the other arrow implementations do to handle this ambiguous portion of the spec?

@devanbenz
Copy link

devanbenz commented Sep 12, 2024

In lieu of apache/datafusion#12218 I'm going to bring this thread back from the dead 👻.

The TL;DR is that we change nothing since we are in line with the current Apache Arrow expectations and datafusion functions similar to a different system that uses Arrow--Clickhouse it is consistent. This proposal is under the assumption that we desire to be completely in line with the current Apache Arrow specification though.

The problem and potential slated proposal

There currently is a misalignment with how casting from Timestamp('time/zone') -> Timestamp(None) functions. There is an expectation that this cast should effectively "retain" the timezone, for example:

2033-05-18T08:33:20+01:00 -> 2033-05-18T08:33:20

As of right now the arrow spec here has indicated that the current state of casting should be like so:

2033-05-18T08:33:20+01:00 -> 2033-05-18T07:33:20

Currently there is a proposal that when casting a ISO style timestamp with a timezone to the Timestamp data type it should not adjust the actual date time.

@tustvold shared this code snippet that outlines the expectation/proposal pretty well.

let a = StringArray::from_iter_values([
    "2033-05-18T08:33:20",
    "2033-05-18T08:33:20Z",
    "2033-05-18T08:33:20 +01:00",
]);

let no_timezone = cast(&a, &DataType::Timestamp(TimeUnit::Nanosecond, None)).unwrap();
let back = cast(&no_timezone, &DataType::Utf8).unwrap();
assert_eq!(
    back.as_string::<i32>(),
    &StringArray::from_iter_values([
        "2033-05-18T08:33:20",
        "2033-05-18T08:33:20",
        "2033-05-18T07:33:20" <---- this issue is proposing changing this to 2033-05-18T08:33:20
    ])
);

External research

Per @Omega359's comment

Has anyone looked into what the other arrow implementations do to handle this ambiguous portion of the spec?

Currently when running the following query in Clickhouse which uses arrow the outcome is similar to how arrow-rs is treating this cast (clickhouse appears to handle conversions within their cast using ::timestamp('timezone') syntax -- please check the issue I linked below for more details):

In clickhouse (arrow):

:) SELECT toTimeZone('2024-09-10 11:45:19'::timestamp, 'America/Denver');

   ┌─toTimeZone(CAST('2024-09-10 11:45:19', 'timestamp'), 'America/Denver')─┐
1. │                                                    2024-09-10 10:45:19 │
   └────────────────────────────────────────────────────────────────────────┘

:) SELECT toTimeZone('2024-09-10 11:45:19'::timestamp, 'America/Denver')::timestamp;

   ┌─CAST(toTimeZone(CAST('2024-09-10 11:45:19', 'timestamp'), 'America/Denver'), 'timestamp')─┐
1. │                                                                       2024-09-10 11:45:19 │
   └───────────────────────────────────────────────────────────────────────────────────────────┘

In datafusion (arrow-rs):

> select ('2024-09-10 11:45:19' AT TIME ZONE 'America/Denver');
+-----------------------------+
| Utf8("2024-09-10 11:45:19") |
+-----------------------------+
| 2024-09-10T11:45:19-06:00   |
+-----------------------------+
1 row(s) fetched.
Elapsed 0.048 seconds.

> select ('2024-09-10 11:45:19' AT TIME ZONE 'America/Denver')::timestamp;
+-----------------------------+
| Utf8("2024-09-10 11:45:19") |
+-----------------------------+
| 2024-09-10T17:45:19         |
+-----------------------------+
1 row(s) fetched.
Elapsed 0.006 seconds.

Note: Datafusion uses UTC where clickhouse appears to use local time.

When taking a look at Postgres I see the opposite functionality:

postgres=# SELECT ('2024-09-10 11:45:19' AT TIME ZONE 'America/Denver');
      timezone
---------------------
 2024-09-10 10:45:19
(1 row)

postgres=# SELECT ('2024-09-10 11:45:19' AT TIME ZONE 'America/Denver')::timestamp;
      timezone
---------------------
 2024-09-10 10:45:19
(1 row)

The timezone is effectively retained.


I filed a ticket within clickhouse to ask about this functionality: ClickHouse/ClickHouse#69512

I was met with the following comment:

To clarify the current behaviour: The types DateTime <> DateTime('America/Denver'). It's not stripping, it kinda cast to another type.

There are a few tickets I've seen scattered throughout Arrow's issues that sort of relate to this functionality (as far as I'm aware):

Most of them are closed with no indication that this will be adjusted.

Proposal

Even though I personally think that the way Postgres does this is "more correct" that is just a matter of opinion. For consistency sake it would be best to keep the functionality as is so it is in line with Arrow's spec. If someone were to be a consumer of a different product using arrow and migrating over/using a product that uses arrow-rs it would likely make the most sense to have the same functionality.

I've also added a proposal to the original issue that led me to this thread: apache/datafusion#12218

@findepi
Copy link
Member

findepi commented Sep 13, 2024

There currently is a misalignment with how casting from Timestamp('time/zone') -> Timestamp(None) functions. There is an expectation that this cast should effectively "retain" the timezone, for example:

2033-05-18T08:33:20+01:00 -> 2033-05-18T08:33:20

from SQL spec perspective. this is "the correct way" .

the way Postgres does this is "more correct"

BTW I've been heavily involved in timestamp-related work in the Trino project. Timestamps is one of very few cases where following PostgreSQL is not the right thing, because PostgreSQL misses real support for zones:

For timestamp with time zone, the internally stored value is always in UTC (Universal Coordinated Time, traditionally known as Greenwich Mean Time, GMT). An input value that has an explicit time zone specified is converted to UTC using the appropriate offset for that time zone.1

Footnotes

  1. https://www.postgresql.org/docs/current/datatype-datetime.html

@devanbenz
Copy link

devanbenz commented Sep 13, 2024

There currently is a misalignment with how casting from Timestamp('time/zone') -> Timestamp(None) functions. There is an expectation that this cast should effectively "retain" the timezone, for example:
2033-05-18T08:33:20+01:00 -> 2033-05-18T08:33:20

from SQL spec perspective. this is "the correct way" .

Ah yeah, the only reason I think that it's valid to effectively "do nothing" is that arrow-rs is currently in line with whats expected of arrow. If its the case that arrow-rs should diverge from arrow then I'm under the opinion that it should be adjusted to the SQL spec. I have no power to make a final decision --just my two cents :P

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants