Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cast Kernel Ignores Timezone #1936

Closed
Tracked by #3148
tustvold opened this issue Jun 24, 2022 · 16 comments · Fixed by #4034 or #4201
Closed
Tracked by #3148

Cast Kernel Ignores Timezone #1936

tustvold opened this issue Jun 24, 2022 · 16 comments · Fixed by #4034 or #4201
Assignees
Labels

Comments

@tustvold
Copy link
Contributor

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

The beginnings of timezone support were added in #824, however, this is currently ignored by the cast kernel

Describe the solution you'd like

Timezones should be correctly handled by the cast kernel

Describe alternatives you've considered

We could not support timezones

Additional context

Noticed whilst investigating #1932

@doki23
Copy link
Contributor

doki23 commented Sep 18, 2022

I'd like to work on this. And I think fmt of timestamp array cannot ignore timezone too.
Could you assign it to me? @tustvold

@tustvold tustvold self-assigned this Sep 18, 2022
@tustvold
Copy link
Contributor Author

I would recommend writing up the expected behaviour first, as timezone handling is notoriously messy, and once we have consensus we can move forward with implementing that.

FYI @avantgardnerio @waitingkuo

@waitingkuo
Copy link
Contributor

waitingkuo commented Sep 19, 2022

@tustvold thank you for pinging me, i'm working on these things now as well.

@doki23 it would be great if you could help ❤️
You could find some other related issues here apache/datafusion#3148 (search timezone)

some hints that might help

1

(Timestamp(_, _), Int64) => cast_array_data::<Int64Type>(array, to_type.clone()),

to make casting function consider timezone, we have to fix the second _ identifier and check whether it's None or Some

2

pub(crate) fn as_datetime<T: ArrowPrimitiveType>(v: i64) -> Option<NaiveDateTime> {

while using fmt to print, we first convert it to NaiveDateTime (from chrono-rs) which contains no timezone info so that you could only see timestamp without timezone

@waitingkuo
Copy link
Contributor

@doki23 are you planning to draft the proposal?

@doki23
Copy link
Contributor

doki23 commented Sep 21, 2022

@doki23 are you planning to draft the proposal?

sure, plz wait me hours

@doki23
Copy link
Contributor

doki23 commented Sep 21, 2022

We consider tz only if from_type and to_type both needs it. For example, ignoring tz is ok when we cast ts to i64, because i64 array doesn't care about timezone.

So, there're 2 situations:

  1. ts is from_type, and the to_type needs tz.
  2. ts is to_type, and the from_type has tz.

I noticed that timestamp array is always treat as PrimitiveArray. We can't get tz from ArrowPrimitiveType::DATA_TYPE, because there're only utc ts definitions like:

make_type!(
    TimestampSecondType,
    i64,
    DataType::Timestamp(TimeUnit::Second, None)
);

So, we may need a TimestampBuilder in place of TimestampSecondArray::from_xxx which can realize the timezone. And it has a valid ArrayData::data_type.

@waitingkuo
Copy link
Contributor

what's expected behavior for casting timestamp with timezone to timestamp without time zone?

e.g. if the timestamp with timezone is 1970-01-01T08:00:00+08:00 (note that the timestamp underline is 0), what's the casted result? 1970-01-01T08:00:00 (underline timestamp as 28800000000000) or 1970-01-01T00:00:00 (underline timestamp as 0)?

i recommend that listing these ambiguous cases and your proposed behavior so we could discuss

btw i tested it on pyarrow, it simply changes the datatype but not change the underline timestamp (1970-01-01T08:00:00+08:00 becomes 1970-01-01T00:00:00

@avantgardnerio
Copy link
Contributor

what's expected behavior for casting timestamp with timezone to timestamp without time zone?

The intuitive answer would be to convert it to UTC. I think postgres effectively casts it to server (local) time.

@tustvold
Copy link
Contributor Author

I believe this section of the arrow schema definition is relevant - https://github.com/apache/arrow-rs/blob/master/format/Schema.fbs#L280

In particular

One possibility is to assume that the original timestamp values are relative to the epoch of the timezone being set; timestamp values should then adjusted to the Unix epoch

Given this is the only possibility enumerated in the schema, I feel this is probably the one we should follow unless people feel strongly otherwise. My 2 cents is that anything relying on the local timezone of the system is best avoided if at all possible, it just feels fragile and error-prone.

@avantgardnerio
Copy link
Contributor

local timezone of the system

Yes - these RecordBatches could be part of Flights, yes? In which case the whole point is to send them around to different computers that may be in different timezones, so it kind of forces our hand here.

And if we are doing it this way in arrow where we don't have the luxury of following postgres, maybe this is also where we break postgres compatibility in DataFusion. Just because postgres did it wrong doesn't mean we should follow...

@doki23
Copy link
Contributor

doki23 commented Sep 22, 2022

what's expected behavior for casting timestamp with timezone to timestamp without time zone?

Tz has no affect to the value of timestamp, it's just used for display.

@tustvold
Copy link
Contributor Author

what's expected behavior for casting timestamp with timezone to timestamp without time zone?

The specification states

/// However, if a Timestamp column has no timezone value, changing it to a
/// non-empty value requires to think about the desired semantics.
/// One possibility is to assume that the original timestamp values are
/// relative to the epoch of the timezone being set; timestamp values should
/// then adjusted to the Unix epoch (for example, changing the timezone from
/// empty to "Europe/Paris" would require converting the timestamp values
/// from "Europe/Paris" to "UTC", which seems counter-intuitive but is
/// nevertheless correct).

As stated above, given this is the only possibility enumerated I think we should follow this. The inverse operation, i.e. removing a timezone, I would therefore expect to do the reverse i.e. 1970-01-01 01:00:01 +01:00 would become 1970-01-01 01:00:01. This is consistent with both postgres and chrono.

@waitingkuo
Copy link
Contributor

thank you @tustvold , didn't aware this spec before
This looks good to me

btw, i think with_timezone_opt already covered another possibility - keep the underline i64 value and change the metadata

@tustvold
Copy link
Contributor Author

Yeah let's do the "hard" operation in the cast kernel, and if people don't like it, they can perform a metadata-only cast usingwith_timezone_opt 👍

@tustvold
Copy link
Contributor Author

tustvold commented Mar 2, 2023

#3794 is related to this, and implements the necessary machinery for timezone aware parsing of strings

@tustvold
Copy link
Contributor Author

Reopening as there are still issues around casting between timestamps with timezones #4201

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants