-
Notifications
You must be signed in to change notification settings - Fork 842
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Propose change timestamp casting with timezone to without timezones (also parsing of timestamps without timezones) #5827
Comments
This makes sense to me, the arrow specification doesn't specify either way but I think this would be less surprising |
It turns out I actually flagged this on the original PR that altered the timezone casting logic - #4201 (comment) The argument that convinced me in the end is that when parsing from a string we convert to UTC
But perhaps we want to change this also? |
I don't quite follow this example. Given It seems in your example that it is cast to Thus this seems like the same behavior and that it would be changed with the proposal. But I may be missing something |
Right so you're also advocating changing the parsing behaviour, i.e. casting from strings to timestamps, and not just between timestamps. I agree these should be kept consistent, hence why I raised it |
I posted this proposal to the dev list https://lists.apache.org/thread/b6hhsthy9pqhwmjjkox2lbt4qz9zvlvw and on slack/discord to raise awareness |
Thanks for raising this. I think part of the issue is that timestamps without a timezone specifically omit semantics that could be used to convert to/from a specific instant. Given these semantics are lost on conversion, I wouldn't expect the values to roundtrip unless some additional information provided when going from tz-naive to tz-aware. When casting from tz-aware to tz-naive timestamps, absent any other information, it makes sense to me that physical values are unchanged and the tz is stripped. That's what it currently does. Going the other way leaves more room for interpretation as pointed out by the docs. Currently casting the other way converts physical values to be UTC-normalized. Perhaps this behavior would be better reserved for a dedicated "conversion" function that takes some additional parameter(s) to specify semantics. Then casting could be treated as a much simpler operation that just adds/strips the specified tz without changing physical values. That would allow your example to roundtrip. |
in SQL specification
So in standard SQL, one cast is not exactly inverse of the other. |
I agree with @alamb 's proposal. I think the question is what the value
Personally, I think option 1 is the most rational choice. The Option 2 is strange. The If we pick option 1 then the only thing casting does is change how we want to interpret the value in cases where the time zone matters:
The value should never change when casting. |
I don't think this is under question, as this is specified by the arrow format specification - https://github.com/apache/arrow/blob/main/format/Schema.fbs#L276. The question instead concerns how casting between timestamps should behave where one or other lacks a timezone. The specification has the following to say about this: /// However, if a Timestamp column has no timezone value, changing it to a
/// non-empty value requires to think about the desired semantics.
/// One possibility is to assume that the original timestamp values are
/// relative to the epoch of the timezone being set; timestamp values should
/// then adjusted to the Unix epoch (for example, changing the timezone from
/// empty to "Europe/Paris" would require converting the timestamp values
/// from "Europe/Paris" to "UTC", which seems counter-intuitive but is
/// nevertheless correct).
Which is consistent with what we implement let a = StringArray::from_iter_values([
"2033-05-18T08:33:20",
"2033-05-18T08:33:20Z",
"2033-05-18T08:33:20 +01:00",
]);
let no_timezone = cast(&a, &DataType::Timestamp(TimeUnit::Nanosecond, None)).unwrap();
let back = cast(&no_timezone, &DataType::Utf8).unwrap();
assert_eq!(
back.as_string::<i32>(),
&StringArray::from_iter_values([
"2033-05-18T08:33:20",
"2033-05-18T08:33:20",
"2033-05-18T07:33:20" <---- this issue is proposing changing this to 2033-05-18T08:33:20
])
);
let with_timezone = cast(
&a,
&DataType::Timestamp(TimeUnit::Nanosecond, Some("+01:00".into())),
)
.unwrap();
let back = cast(&with_timezone, &DataType::Utf8).unwrap();
assert_eq!(
back.as_string::<i32>(),
&StringArray::from_iter_values([
"2033-05-18T08:33:20+01:00",
"2033-05-18T09:33:20+01:00",
"2033-05-18T08:33:20+01:00",
])
) Where the specification is ambiguous is what to do when going in the reverse direction, i.e. from a timestamp with a timezone to one without a timezone. Currently we use the time in the UTC epoch. i.e. The proposal, as far as I understand it, is for
The arrow format explicitly calls out that the value should change Edit: It is perhaps worth noting that in the case of a timestamp on the daylight savings boundary, taking the local timestamp instead of UTC as we do currently loses information, and casting back to that timezone will yield an ambiguous timestamp error. I am not sure if this matters. Edit edit: If we do opt to change as proposed, the old behaviour could be obtained by first converting to the UTC epoch. This would be a strictly metadata operation, as the values are already UTC when a timezone is present. The proposed behaviour is therefore strictly more expressive. |
if indeed proposed so, then it would be in line with SQL spec and so in line with what a few other engines do. |
An update here is that we think we have a workaround for our usecase that we have added to DataFusion (a I am not sure there is consensus that this change of behavior is desired (though it is not clear to me that there is consensus it would not be desired either)
This worries me as it sounds like it means it would be impossible to make timestamps roundtripping in all cases It isn't clear to me that "never error but doesn't round trip" to "round trips but sometimes errors" is an improvement in functionality |
Postgresql has what I would call 'sane' behaviour in the face of ambiguous timestamps. While I think that PG's has some really odd behaviour with multiple 'at time zone's (see this comment) I think in this case the direction they went with is a solid defensible one. If arrow-rs was to emulate that behaviour then round trips wouldn't error but may possibly result in a different time. Has anyone looked into what the other arrow implementations do to handle this ambiguous portion of the spec? |
In lieu of apache/datafusion#12218 I'm going to bring this thread back from the dead 👻. The TL;DR is that we change nothing since we are in line with the current Apache Arrow expectations and datafusion functions similar to a different system that uses Arrow--Clickhouse it is consistent. This proposal is under the assumption that we desire to be completely in line with the current Apache Arrow specification though. The problem and potential slated proposalThere currently is a misalignment with how casting from
As of right now the arrow spec here has indicated that the current state of casting should be like so:
Currently there is a proposal that when casting a ISO style timestamp with a timezone to the @tustvold shared this code snippet that outlines the expectation/proposal pretty well.
External researchPer @Omega359's comment
Currently when running the following query in In clickhouse (arrow):
In datafusion (arrow-rs):
Note: Datafusion uses UTC where clickhouse appears to use local time. When taking a look at Postgres I see the opposite functionality:
The timezone is effectively retained. I filed a ticket within clickhouse to ask about this functionality: ClickHouse/ClickHouse#69512 I was met with the following comment:
There are a few tickets I've seen scattered throughout Arrow's issues that sort of relate to this functionality (as far as I'm aware):
Most of them are closed with no indication that this will be adjusted. ProposalEven though I personally think that the way Postgres does this is "more correct" that is just a matter of opinion. For consistency sake it would be best to keep the functionality as is so it is in line with Arrow's spec. If someone were to be a consumer of a different product using arrow and migrating over/using a product that uses arrow-rs it would likely make the most sense to have the same functionality. I've also added a proposal to the original issue that led me to this thread: apache/datafusion#12218 |
from SQL spec perspective. this is "the correct way" .
BTW I've been heavily involved in timestamp-related work in the Trino project. Timestamps is one of very few cases where following PostgreSQL is not the right thing, because PostgreSQL misses real support for zones: For timestamp with time zone, the internally stored value is always in UTC (Universal Coordinated Time, traditionally known as Greenwich Mean Time, GMT). An input value that has an explicit time zone specified is converted to UTC using the appropriate offset for that time zone.1 Footnotes |
Ah yeah, the only reason I think that it's valid to effectively "do nothing" is that arrow-rs is currently in line with whats expected of arrow. If its the case that arrow-rs should diverge from arrow then I'm under the opinion that it should be adjusted to the SQL spec. I have no power to make a final decision --just my two cents :P |
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
This is in the context of implementing
date_bin
for timestamps with timezones: apache/datafusion#10602I made #5826 to document the behavior of casting timestamps and I found it very confusing. Specifically when you cast from
Timestamp(None)
toTimestamp(Some(tz))
and then back toTimetamp(None)
the underlying timestamp values are changed as shown in this exampleThus I wanted to discuss if we should change the behavior to make it less surprising or if there was a reason to leave the current behavior
Describe the solution you'd like
I propose making
casting timestamp with a timezone to timestamp without a timezone
do the inverse ofcasting timestamp withpit a timezone to timestamp with a timezone
This would mean the final value of d in the above example is
2_000_000_000
, not2_000_018_000
Describe alternatives you've considered
Leave existing behavior
Additional context
The text was updated successfully, but these errors were encountered: