-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplify type signatures using TypeSignatureClass
for mixed type function signature
#13372
Changes from 20 commits
2ea2c27
664edaa
fc99216
2eac9f8
dd3fb7f
e114c86
f04aed5
3b8030c
6b1e08a
afa23df
1b2a3fd
45d417f
1e43c90
afe48d1
e666e0d
13fb7ed
f141e89
e89520e
7920421
4a7404d
7400429
3647bb5
830b2d5
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Original file line number | Diff line number | Diff line change | ||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
@@ -18,8 +18,10 @@ | |||||||||||||||||||||
//! Signature module contains foundational types that are used to represent signatures, types, | ||||||||||||||||||||||
//! and return types of functions in DataFusion. | ||||||||||||||||||||||
|
||||||||||||||||||||||
use std::fmt::Display; | ||||||||||||||||||||||
|
||||||||||||||||||||||
use crate::type_coercion::aggregates::NUMERICS; | ||||||||||||||||||||||
use arrow::datatypes::DataType; | ||||||||||||||||||||||
use arrow::datatypes::{DataType, IntervalUnit, TimeUnit}; | ||||||||||||||||||||||
use datafusion_common::types::{LogicalTypeRef, NativeType}; | ||||||||||||||||||||||
use itertools::Itertools; | ||||||||||||||||||||||
|
||||||||||||||||||||||
|
@@ -112,7 +114,7 @@ pub enum TypeSignature { | |||||||||||||||||||||
/// For example, `Coercible(vec![logical_float64()])` accepts | ||||||||||||||||||||||
/// arguments like `vec![DataType::Int32]` or `vec![DataType::Float32]` | ||||||||||||||||||||||
/// since i32 and f32 can be casted to f64 | ||||||||||||||||||||||
Coercible(Vec<LogicalTypeRef>), | ||||||||||||||||||||||
Coercible(Vec<TypeSignatureClass>), | ||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nice! |
||||||||||||||||||||||
/// The arguments will be coerced to a single type based on the comparison rules. | ||||||||||||||||||||||
/// For example, i32 and i64 has coerced type Int64. | ||||||||||||||||||||||
/// | ||||||||||||||||||||||
|
@@ -154,6 +156,33 @@ impl TypeSignature { | |||||||||||||||||||||
} | ||||||||||||||||||||||
} | ||||||||||||||||||||||
|
||||||||||||||||||||||
/// Represents the class of types that can be used in a function signature. | ||||||||||||||||||||||
/// | ||||||||||||||||||||||
/// This is used to specify what types are valid for function arguments in a more flexible way than | ||||||||||||||||||||||
/// just listing specific DataTypes. For example, TypeSignatureClass::Timestamp matches any timestamp | ||||||||||||||||||||||
/// type regardless of timezone or precision. | ||||||||||||||||||||||
/// | ||||||||||||||||||||||
/// Used primarily with TypeSignature::Coercible to define function signatures that can accept | ||||||||||||||||||||||
/// arguments that can be coerced to a particular class of types. | ||||||||||||||||||||||
#[derive(Debug, Clone, Eq, PartialEq, PartialOrd, Hash)] | ||||||||||||||||||||||
pub enum TypeSignatureClass { | ||||||||||||||||||||||
alamb marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||||||||
Timestamp, | ||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We may need to treat timestamp and timestamp with zone separately 🤔 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Unless there is function that treat timestamp differently based on timezone, otherwise it is simpler to treat them equivalently. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think some functions currently define that they take a UTC timestamp or ANY timestamp via https://docs.rs/datafusion/latest/datafusion/logical_expr/constant.TIMEZONE_WILDCARD.html There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I also feel like we should have another variant for
They can't interact with each other before applying some casting or coercion. For example, datafusion/datafusion/functions/src/datetime/date_bin.rs Lines 59 to 68 in 91670e2
The
It means we can accept the SQL like If we have a class for TypeSianture::one_of([
TypeSignature::Coercible(vec![
TypeSignatureClass::Interval,
TypeSignatureClass::Timstamp,
TypeSignatureClass::Timstamp,
]),
TypeSignature::Coercible(vec![
TypeSignatureClass::Interval,
TypeSignatureClass::Timstamp_with_time_zone,
TypeSignatureClass::Timstamp_with_time_zone,
]),
]) It's more close to the original signature. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How about managing timestamps within the invoke function? Handling specific cases like Timestamp_with_time_zone or TIMEZONE_WILDCARD adds unnecessary complexity to the function's signature without providing much benefit. Instead, why not define a high-level Timestamp in the signature and handle the finer details elsewhere? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe we can refer to the behavior of Postgrestest=# SELECT proname, proargtypes FROM pg_proc WHERE proname = 'age';
proname | proargtypes
---------+-------------
age | 28
age | 1114
age | 1184
age | 1114 1114
age | 1184 1184
(5 rows)
test=# SELECT oid, typname FROM pg_type WHERE oid IN (28, 1184, 1114);
oid | typname
------+-------------
28 | xid
1114 | timestamp
1184 | timestamptz
(3 rows) DuckDBD SELECT function_name, parameter_types
FROM duckdb_functions()
WHERE function_name = 'age';
┌───────────────┬──────────────────────────────────────────────────────┐
│ function_name │ parameter_types │
│ varchar │ varchar[] │
├───────────────┼──────────────────────────────────────────────────────┤
│ age │ [TIMESTAMP] │
│ age │ [TIMESTAMP, TIMESTAMP] │
│ age │ [TIMESTAMP WITH TIME ZONE, TIMESTAMP WITH TIME ZONE] │
│ age │ [TIMESTAMP WITH TIME ZONE] │
└───────────────┴──────────────────────────────────────────────────────┘ If you try to input an unsupported type value, you may encounter an error like the following: D SELECT age('2001-01-01 18:00:00');
Binder Error: Could not choose a best candidate function for the function call "age(STRING_LITERAL)". In order to select one, please add explicit type casts.
Candidate functions:
age(TIMESTAMP WITH TIME ZONE) -> INTERVAL
age(TIMESTAMP) -> INTERVAL
LINE 1: SELECT age('2001-01-01 18:00:00'); Both systems treat The advantage of separating these types is that it allows for stricter type-checking when matching a function's signature. This reduces the likelihood of developers failing to correctly handle type checks when implementing timestamp functions. Honestly, I'm not sure how complex it would become if we separated them 🤔 . If it requires significant effort, I'm fine with keeping the current design. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
that's a good point, they are quite different: only the latter denotes point in time. the former denotes "wall date/time" with no zone information, so does not denote any particular point in time. their similarity is deceptive and source of many bugs
the timestamp and timestamp tz values are different arrow types, so the function implementor needs to handle them separately anyway The point is, some functions will be applicable to one of these types but not the other. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My solution so far is that we use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Agreed. Handling Timestamps using |
||||||||||||||||||||||
Date, | ||||||||||||||||||||||
Time, | ||||||||||||||||||||||
Interval, | ||||||||||||||||||||||
Duration, | ||||||||||||||||||||||
Native(LogicalTypeRef), | ||||||||||||||||||||||
// TODO: | ||||||||||||||||||||||
// Numeric | ||||||||||||||||||||||
// Integer | ||||||||||||||||||||||
} | ||||||||||||||||||||||
|
||||||||||||||||||||||
impl Display for TypeSignatureClass { | ||||||||||||||||||||||
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result { | ||||||||||||||||||||||
write!(f, "TypeSignatureClass::{self:?}") | ||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Display looks more verbose than Debug. Typically it's the other way around. the produced err msg looks a bit longish, but i don't know how to make it more readabile. thoughts?
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree it would be great to have a special type signature class display Maybe something like |
||||||||||||||||||||||
} | ||||||||||||||||||||||
} | ||||||||||||||||||||||
|
||||||||||||||||||||||
#[derive(Debug, Clone, PartialEq, Eq, PartialOrd, Hash)] | ||||||||||||||||||||||
pub enum ArrayFunctionSignature { | ||||||||||||||||||||||
/// Specialized Signature for ArrayAppend and similar functions | ||||||||||||||||||||||
|
@@ -180,7 +209,7 @@ pub enum ArrayFunctionSignature { | |||||||||||||||||||||
MapArray, | ||||||||||||||||||||||
} | ||||||||||||||||||||||
|
||||||||||||||||||||||
impl std::fmt::Display for ArrayFunctionSignature { | ||||||||||||||||||||||
impl Display for ArrayFunctionSignature { | ||||||||||||||||||||||
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result { | ||||||||||||||||||||||
match self { | ||||||||||||||||||||||
ArrayFunctionSignature::ArrayAndElement => { | ||||||||||||||||||||||
|
@@ -255,7 +284,7 @@ impl TypeSignature { | |||||||||||||||||||||
} | ||||||||||||||||||||||
|
||||||||||||||||||||||
/// Helper function to join types with specified delimiter. | ||||||||||||||||||||||
pub fn join_types<T: std::fmt::Display>(types: &[T], delimiter: &str) -> String { | ||||||||||||||||||||||
pub fn join_types<T: Display>(types: &[T], delimiter: &str) -> String { | ||||||||||||||||||||||
types | ||||||||||||||||||||||
.iter() | ||||||||||||||||||||||
.map(|t| t.to_string()) | ||||||||||||||||||||||
|
@@ -290,7 +319,30 @@ impl TypeSignature { | |||||||||||||||||||||
.collect(), | ||||||||||||||||||||||
TypeSignature::Coercible(types) => types | ||||||||||||||||||||||
.iter() | ||||||||||||||||||||||
.map(|logical_type| get_data_types(logical_type.native())) | ||||||||||||||||||||||
.map(|logical_type| match logical_type { | ||||||||||||||||||||||
TypeSignatureClass::Native(l) => get_data_types(l.native()), | ||||||||||||||||||||||
TypeSignatureClass::Timestamp => { | ||||||||||||||||||||||
vec![ | ||||||||||||||||||||||
DataType::Timestamp(TimeUnit::Nanosecond, None), | ||||||||||||||||||||||
DataType::Timestamp( | ||||||||||||||||||||||
TimeUnit::Nanosecond, | ||||||||||||||||||||||
Some(TIMEZONE_WILDCARD.into()), | ||||||||||||||||||||||
), | ||||||||||||||||||||||
] | ||||||||||||||||||||||
} | ||||||||||||||||||||||
TypeSignatureClass::Date => { | ||||||||||||||||||||||
vec![DataType::Date64] | ||||||||||||||||||||||
} | ||||||||||||||||||||||
TypeSignatureClass::Time => { | ||||||||||||||||||||||
vec![DataType::Time64(TimeUnit::Nanosecond)] | ||||||||||||||||||||||
} | ||||||||||||||||||||||
TypeSignatureClass::Interval => { | ||||||||||||||||||||||
vec![DataType::Interval(IntervalUnit::DayTime)] | ||||||||||||||||||||||
} | ||||||||||||||||||||||
TypeSignatureClass::Duration => { | ||||||||||||||||||||||
vec![DataType::Duration(TimeUnit::Nanosecond)] | ||||||||||||||||||||||
} | ||||||||||||||||||||||
}) | ||||||||||||||||||||||
.multi_cartesian_product() | ||||||||||||||||||||||
.collect(), | ||||||||||||||||||||||
TypeSignature::Variadic(types) => types | ||||||||||||||||||||||
|
@@ -424,7 +476,10 @@ impl Signature { | |||||||||||||||||||||
} | ||||||||||||||||||||||
} | ||||||||||||||||||||||
/// Target coerce types in order | ||||||||||||||||||||||
pub fn coercible(target_types: Vec<LogicalTypeRef>, volatility: Volatility) -> Self { | ||||||||||||||||||||||
pub fn coercible( | ||||||||||||||||||||||
target_types: Vec<TypeSignatureClass>, | ||||||||||||||||||||||
volatility: Volatility, | ||||||||||||||||||||||
) -> Self { | ||||||||||||||||||||||
Self { | ||||||||||||||||||||||
type_signature: TypeSignature::Coercible(target_types), | ||||||||||||||||||||||
volatility, | ||||||||||||||||||||||
|
@@ -618,8 +673,10 @@ mod tests { | |||||||||||||||||||||
] | ||||||||||||||||||||||
); | ||||||||||||||||||||||
|
||||||||||||||||||||||
let type_signature = | ||||||||||||||||||||||
TypeSignature::Coercible(vec![logical_string(), logical_int64()]); | ||||||||||||||||||||||
let type_signature = TypeSignature::Coercible(vec![ | ||||||||||||||||||||||
TypeSignatureClass::Native(logical_string()), | ||||||||||||||||||||||
TypeSignatureClass::Native(logical_int64()), | ||||||||||||||||||||||
]); | ||||||||||||||||||||||
let possible_types = type_signature.get_possible_types(); | ||||||||||||||||||||||
assert_eq!( | ||||||||||||||||||||||
possible_types, | ||||||||||||||||||||||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -22,14 +22,18 @@ use arrow::{ | |
datatypes::{DataType, TimeUnit}, | ||
}; | ||
use datafusion_common::{ | ||
exec_err, internal_datafusion_err, internal_err, plan_err, | ||
exec_err, internal_datafusion_err, internal_err, not_impl_err, plan_err, | ||
types::{LogicalType, NativeType}, | ||
utils::{coerced_fixed_size_list_to_list, list_ndims}, | ||
Result, | ||
}; | ||
use datafusion_expr_common::{ | ||
signature::{ArrayFunctionSignature, FIXED_SIZE_LIST_WILDCARD, TIMEZONE_WILDCARD}, | ||
type_coercion::binary::{comparison_coercion_numeric, string_coercion}, | ||
signature::{ | ||
ArrayFunctionSignature, TypeSignatureClass, FIXED_SIZE_LIST_WILDCARD, | ||
TIMEZONE_WILDCARD, | ||
}, | ||
type_coercion::binary::comparison_coercion_numeric, | ||
type_coercion::binary::string_coercion, | ||
}; | ||
use std::sync::Arc; | ||
|
||
|
@@ -568,35 +572,65 @@ fn get_valid_types( | |
// Make sure the corresponding test is covered | ||
// If this function becomes COMPLEX, create another new signature! | ||
fn can_coerce_to( | ||
logical_type: &NativeType, | ||
target_type: &NativeType, | ||
) -> bool { | ||
if logical_type == target_type { | ||
return true; | ||
} | ||
current_type: &DataType, | ||
target_type_class: &TypeSignatureClass, | ||
) -> Result<DataType> { | ||
let logical_type: NativeType = current_type.into(); | ||
|
||
if logical_type == &NativeType::Null { | ||
return true; | ||
} | ||
match target_type_class { | ||
TypeSignatureClass::Native(native_type) => { | ||
let target_type = native_type.native(); | ||
if &logical_type == target_type { | ||
return target_type.default_cast_for(current_type); | ||
} | ||
|
||
if target_type.is_integer() && logical_type.is_integer() { | ||
return true; | ||
} | ||
if logical_type == NativeType::Null { | ||
return target_type.default_cast_for(current_type); | ||
} | ||
|
||
if target_type.is_integer() && logical_type.is_integer() { | ||
return target_type.default_cast_for(current_type); | ||
} | ||
|
||
false | ||
internal_err!( | ||
"Expect {} but received {}", | ||
target_type_class, | ||
current_type | ||
) | ||
} | ||
// Not consistent with Postgres and DuckDB but to avoid regression we implicit cast string to timestamp | ||
TypeSignatureClass::Timestamp | ||
if logical_type == NativeType::String => | ||
{ | ||
Ok(DataType::Timestamp(TimeUnit::Nanosecond, None)) | ||
} | ||
TypeSignatureClass::Timestamp if logical_type.is_timestamp() => { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. timestamp is matched whatever the timezone it has |
||
Ok(current_type.to_owned()) | ||
} | ||
TypeSignatureClass::Date if logical_type.is_date() => { | ||
Ok(current_type.to_owned()) | ||
} | ||
TypeSignatureClass::Time if logical_type.is_time() => { | ||
Ok(current_type.to_owned()) | ||
} | ||
TypeSignatureClass::Interval if logical_type.is_interval() => { | ||
Ok(current_type.to_owned()) | ||
} | ||
TypeSignatureClass::Duration if logical_type.is_duration() => { | ||
Ok(current_type.to_owned()) | ||
} | ||
_ => { | ||
not_impl_err!("Got logical_type: {logical_type} with target_type_class: {target_type_class}") | ||
} | ||
} | ||
} | ||
|
||
let mut new_types = Vec::with_capacity(current_types.len()); | ||
for (current_type, target_type) in | ||
for (current_type, target_type_class) in | ||
current_types.iter().zip(target_types.iter()) | ||
{ | ||
let logical_type: NativeType = current_type.into(); | ||
let target_logical_type = target_type.native(); | ||
if can_coerce_to(&logical_type, target_logical_type) { | ||
let target_type = | ||
target_logical_type.default_cast_for(current_type)?; | ||
new_types.push(target_type); | ||
} | ||
let target_type = can_coerce_to(current_type, target_type_class)?; | ||
new_types.push(target_type); | ||
} | ||
|
||
vec![new_types] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯