-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplify type signatures using TypeSignatureClass
for mixed type function signature
#13372
Conversation
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
|
||
#[derive(Debug, Clone, Eq, PartialOrd)] | ||
pub enum TypeSignatureClass { | ||
Timestamp, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may need to treat timestamp and timestamp with zone separately 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless there is function that treat timestamp differently based on timezone, otherwise it is simpler to treat them equivalently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think some functions currently define that they take a UTC timestamp or ANY timestamp via https://docs.rs/datafusion/latest/datafusion/logical_expr/constant.TIMEZONE_WILDCARD.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also feel like we should have another variant for timestamp with time zone
. IMO, they are different types.
Timestamp(timeunit, None)
fortimestamp without time zone
.Timestamp(timeunit, Some(TIMEZONE_WILDCARD)
fortimestamp with time zone
.
They can't interact with each other before applying some casting or coercion.
For example,
datafusion/datafusion/functions/src/datetime/date_bin.rs
Lines 59 to 68 in 91670e2
Exact(vec![ | |
DataType::Interval(MonthDayNano), | |
Timestamp(array_type, None), | |
Timestamp(Nanosecond, None), | |
]), | |
Exact(vec![ | |
DataType::Interval(MonthDayNano), | |
Timestamp(array_type, Some(TIMEZONE_WILDCARD.into())), | |
Timestamp(Nanosecond, Some(TIMEZONE_WILDCARD.into())), | |
]), |
The date_bin
function accepts two Timestamp
arguments. However, if we try to simplify here, we may write something like
TypeSignature::Coercible(vec![
TypeSignatureClass::Interval,
TypeSignatureClass::Timstamp,
TypeSignatureClass::Timstamp,
]),
It means we can accept the SQL like date_bin(INTERVAL 1 HOUR, timestamp_without_timezone_col, timestamp_with_timezone_col)
but I guess it's not correct. (no match the original signature)
If we have a class for timestamp with time zone
, we can write
TypeSianture::one_of([
TypeSignature::Coercible(vec![
TypeSignatureClass::Interval,
TypeSignatureClass::Timstamp,
TypeSignatureClass::Timstamp,
]),
TypeSignature::Coercible(vec![
TypeSignatureClass::Interval,
TypeSignatureClass::Timstamp_with_time_zone,
TypeSignatureClass::Timstamp_with_time_zone,
]),
])
It's more close to the original signature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about managing timestamps within the invoke function? Handling specific cases like Timestamp_with_time_zone or TIMEZONE_WILDCARD adds unnecessary complexity to the function's signature without providing much benefit. Instead, why not define a high-level Timestamp in the signature and handle the finer details elsewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can refer to the behavior of Postgres
or DuckDB
. I listed the signature of the timestamp function age
from their system catalogs for reference.
Postgres
test=# SELECT proname, proargtypes FROM pg_proc WHERE proname = 'age';
proname | proargtypes
---------+-------------
age | 28
age | 1114
age | 1184
age | 1114 1114
age | 1184 1184
(5 rows)
test=# SELECT oid, typname FROM pg_type WHERE oid IN (28, 1184, 1114);
oid | typname
------+-------------
28 | xid
1114 | timestamp
1184 | timestamptz
(3 rows)
DuckDB
D SELECT function_name, parameter_types
FROM duckdb_functions()
WHERE function_name = 'age';
┌───────────────┬──────────────────────────────────────────────────────┐
│ function_name │ parameter_types │
│ varchar │ varchar[] │
├───────────────┼──────────────────────────────────────────────────────┤
│ age │ [TIMESTAMP] │
│ age │ [TIMESTAMP, TIMESTAMP] │
│ age │ [TIMESTAMP WITH TIME ZONE, TIMESTAMP WITH TIME ZONE] │
│ age │ [TIMESTAMP WITH TIME ZONE] │
└───────────────┴──────────────────────────────────────────────────────┘
If you try to input an unsupported type value, you may encounter an error like the following:
D SELECT age('2001-01-01 18:00:00');
Binder Error: Could not choose a best candidate function for the function call "age(STRING_LITERAL)". In order to select one, please add explicit type casts.
Candidate functions:
age(TIMESTAMP WITH TIME ZONE) -> INTERVAL
age(TIMESTAMP) -> INTERVAL
LINE 1: SELECT age('2001-01-01 18:00:00');
Both systems treat TIMESTAMP
and TIMESTAMP WITH TIME ZONE
as distinct types in the high level.
The advantage of separating these types is that it allows for stricter type-checking when matching a function's signature. This reduces the likelihood of developers failing to correctly handle type checks when implementing timestamp functions.
Honestly, I'm not sure how complex it would become if we separated them 🤔 . If it requires significant effort, I'm fine with keeping the current design.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both systems treat
TIMESTAMP
andTIMESTAMP WITH TIME ZONE
as distinct types in the high level.
that's a good point, they are quite different: only the latter denotes point in time. the former denotes "wall date/time" with no zone information, so does not denote any particular point in time.
their similarity is deceptive and source of many bugs
This reduces the likelihood of developers failing to correctly handle type checks when implementing timestamp functions.
the timestamp and timestamp tz values are different arrow types, so the function implementor needs to handle them separately anyway
The point is, some functions will be applicable to one of these types but not the other.
for example, a (hypothetical) to_unix_timestamp(timestamp_tz) -> Int64
function should operate on point in time value, so it should accept the timestamp_tz.
Note that in SQL, timestamp is coercible to timestamp_tz, so such function is still going to be callable with timestamp value, but that's not something function implementor should be concerned about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My solution so far is that we use TypeSignatureClass::Timestamp
if we don't care about it has timezone or not. Fallback to TypeSignatureClass::Native()
if we need to tell the difference
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My solution so far is that we use TypeSignature::Timestamp if we don't care about it has timezone or not. Fallback to
TypeSignature::Native()
if we need to tell the difference
Agreed. Handling Timestamps using TypeSignatureClass::Native
is a good idea. I still think we should treat them as different types.
@@ -112,8 +114,9 @@ pub enum TypeSignature { | |||
/// For example, `Coercible(vec![logical_float64()])` accepts | |||
/// arguments like `vec![DataType::Int32]` or `vec![DataType::Float32]` | |||
/// since i32 and f32 can be casted to f64 | |||
Coercible(Vec<LogicalTypeRef>), | |||
/// Fixed number of arguments of arbitrary types, number should be larger than 0 | |||
Coercible(Vec<TypeSignatureClass>), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice!
---- | ||
12.12345678 | ||
|
||
query R | ||
SELECT date_part('millisecond', '2020-09-08T12:00:12.12345678+00:00') | ||
SELECT date_part('millisecond', timestamp '2020-09-08T12:00:12.12345678+00:00') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why change existing test queries?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because the valid type should be timestamp but not string.
This is the result in Postgres
postgres=# SELECT date_part('millisecond', timestamp '2020-09-08T12:00:12.12345678+00:00');
date_part
-----------
12123.457
(1 row)
postgres=# SELECT date_part('millisecond', '2020-09-08T12:00:12.12345678+00:00');
ERROR: function date_part(unknown, unknown) is not unique
LINE 1: SELECT date_part('millisecond', '2020-09-08T12:00:12.1234567...
^
HINT: Could not choose a best candidate function. You might need to add explicit type casts.
|
||
impl Display for TypeSignatureClass { | ||
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result { | ||
write!(f, "TypeSignatureClass::{self:?}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Display looks more verbose than Debug. Typically it's the other way around.
the produced err msg looks a bit longish, but i don't know how to make it more readabile. thoughts?
Internal error: Expect TypeSignatureClass::Native\(LogicalType\(Native\(Int64\), Int64\)\) but received Float64
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree it would be great to have a special type signature class display
Maybe something like (any int)
or Integer
or (any timestamp)
for Timestamp 🤔
@jayzhan211 is this one ready for review by a committer ? I am having trouble keeping track of everything that is going on and am trying to prioritize reviews where it would be most helpful. If you have any suggestions, please let me know |
Yes, this is
I think we should. I will ping directly if I think it should be prioritized |
- Added logic to prevent unnecessary casting of string types in `native.rs`. - Introduced `Comparable` variant in `TypeSignature` to define coercion rules for comparisons. - Updated imports in `functions.rs` and `signature.rs` for better organization. - Modified `date_part.rs` to improve handling of timestamp extraction and fixed query tests in `expr.slt`. - Added `datafusion-macros` dependency in `Cargo.toml` and `Cargo.lock`. These changes improve type handling and ensure more accurate function behavior in SQL expressions.
Exact(vec![ | ||
Utf8, | ||
Timestamp(Nanosecond, Some(TIMEZONE_WILDCARD.into())), | ||
TypeSignature::Coercible(vec![ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simplified signature is the main goal of this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jayzhan211 -- this is looking quite good. I am worried about the regression of implicit casting from string to timestamps as well as the fact that I think this may not work for timestamps without timezones (seems like there is a gap in testing for datepart -- I will make a PR to add some more tests)
|
||
#[derive(Debug, Clone, Eq, PartialOrd)] | ||
pub enum TypeSignatureClass { | ||
Timestamp, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think some functions currently define that they take a UTC timestamp or ANY timestamp via https://docs.rs/datafusion/latest/datafusion/logical_expr/constant.TIMEZONE_WILDCARD.html
|
||
impl Display for TypeSignatureClass { | ||
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result { | ||
write!(f, "TypeSignatureClass::{self:?}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree it would be great to have a special type signature class display
Maybe something like (any int)
or Integer
or (any timestamp)
for Timestamp 🤔
.map(|logical_type| match logical_type { | ||
TypeSignatureClass::Native(l) => get_data_types(l.native()), | ||
TypeSignatureClass::Timestamp => { | ||
vec![DataType::Timestamp(TimeUnit::Nanosecond, None)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should use WILDCARD to match any timestamp (not just timestamps without timezones)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually used for displaying the signature information for error message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to provide both timestamp with time zone
and timestamp without time zone
here. It's used to provide the possible type for the information_schema routine and parameters table.
vec![DataType::Timestamp(TimeUnit::Nanosecond, None), DataType::Timestamp(TimeUnit::Nanosecond, Some(TIMEZONE_WILDCARD.into()))]
@@ -245,6 +245,8 @@ impl LogicalType for NativeType { | |||
(Self::FixedSizeBinary(size), _) => FixedSizeBinary(*size), | |||
(Self::String, LargeBinary) => LargeUtf8, | |||
(Self::String, BinaryView) => Utf8View, | |||
// We don't cast to another kind of string type if the origin one is already a string type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯
|
||
query I | ||
# Second argument should not be string, failed in postgres too. | ||
query error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this feels like a regression to me (even if it fails in postgres). I think automatically coercing to a string is important for our users in InfluxDB.
…trings. Update SQL logic tests to reflect changes in timestamp handling, including expected outputs for queries involving nanoseconds and seconds.
TypeSignatureClass::Timestamp if logical_type == NativeType::String => { | ||
Ok(DataType::Timestamp(TimeUnit::Nanosecond, None)) | ||
} | ||
TypeSignatureClass::Timestamp if logical_type.is_timestamp() => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
timestamp is matched whatever the timezone it has
…d maintainability. Update the `TypeSignatureClass` documentation to clarify its purpose in function signatures, particularly regarding coercible types. This change enhances the handling of implicit casting from strings to timestamps.
…ctions. Updated expected outputs for `date_part` and `extract` functions to reflect proper behavior with nanoseconds and seconds. This change improves the accuracy of test cases in the `expr.slt` file.
…fication. Updated the logic to include an additional DataType for timestamps with a timezone wildcard, improving flexibility in timestamp operations.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @jayzhan211 -- I know this has been a long process. Sorry for the slower pace but I think as DataFusion attempts to be more stable, more attention is warranted for potential changes.
Other than the string coercion (I left a question) this PR looks good to me given your comment on #13732 (comment)
❤️ Thanks again
query I | ||
SELECT date_part('second', '2020-09-08T12:00:12.12345678+00:00') | ||
---- | ||
12 | ||
|
||
query I | ||
SELECT date_part('millisecond', '2020-09-08T12:00:12.12345678+00:00') | ||
SELECT date_part('second', timestamp '2020-09-08T12:00:12.12345678+00:00') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are these test changes still needed?
From my perspective as long as this PR doesn't change the ability to use a string as an argument to a function that requires a timestamp, it is a good change to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't merge the code from the main
I'm also thinking about the comment above about whether to differentiate timestamp timezone or not. Given that we already have NativeType::Timestamp to differentiate with tz or without cases, I still don't think it is a good idea to add another abstraction in TypeSignatureClass. For date_bin that takes 2 args, I think using |
… not_impl_err import for better error handling
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love it -- thank you @jayzhan211 for pushing this forward and thank you for putting up with the long review cycle
@@ -560,7 +560,7 @@ select repeat('-1.2', arrow_cast(3, 'Int32')); | |||
---- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lack of test changes look great to me ❤️
TypeSignatureClass
for mixed type function signatureTypeSignatureClass
for mixed type function signature
Thanks @alamb @goldmedal @findepi |
Which issue does this PR close?
Part of #13301
Closes #.
Rationale for this change
Some functions require mixed type signature such as (string, integer) or (string, timestamp).
Time related type is added in this PR as the first step.
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?