Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify type signatures using TypeSignatureClass for mixed type function signature #13372

Merged
merged 23 commits into from
Dec 14, 2024
Merged
Show file tree
Hide file tree
Changes from 20 commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
2ea2c27
add type sig class
jayzhan211 Nov 12, 2024
664edaa
timestamp
jayzhan211 Nov 12, 2024
fc99216
date part
jayzhan211 Nov 12, 2024
2eac9f8
fmt
jayzhan211 Nov 12, 2024
dd3fb7f
taplo format
jayzhan211 Nov 12, 2024
e114c86
tpch test
jayzhan211 Nov 12, 2024
f04aed5
msrc issue
jayzhan211 Nov 12, 2024
3b8030c
msrc issue
jayzhan211 Nov 12, 2024
6b1e08a
explicit hash
jayzhan211 Nov 12, 2024
afa23df
Merge branch 'main' of github.com:apache/datafusion into type-class
jayzhan211 Nov 13, 2024
1b2a3fd
Enhance type coercion and function signatures
jayzhan211 Dec 8, 2024
45d417f
Merge branch 'main' of github.com:apache/datafusion into type-class
jayzhan211 Dec 8, 2024
1e43c90
fix comment
jayzhan211 Dec 8, 2024
afe48d1
fix signature
jayzhan211 Dec 8, 2024
e666e0d
Merge branch 'main' of github.com:apache/datafusion into type-class
jayzhan211 Dec 9, 2024
13fb7ed
fix test
jayzhan211 Dec 9, 2024
f141e89
Enhance type coercion for timestamps to allow implicit casting from s…
jayzhan211 Dec 10, 2024
e89520e
Refactor type coercion logic for timestamps to improve readability an…
jayzhan211 Dec 10, 2024
7920421
Fix SQL logic tests to correct query error handling for timestamp fun…
jayzhan211 Dec 11, 2024
4a7404d
Enhance timestamp handling in TypeSignature to support timezone speci…
jayzhan211 Dec 11, 2024
7400429
Merge branch 'main' of github.com:apache/datafusion into type-class
jayzhan211 Dec 12, 2024
3647bb5
Merge branch 'main' of github.com:apache/datafusion into type-class
jayzhan211 Dec 12, 2024
830b2d5
Refactor date_part function: remove redundant imports and add missing…
jayzhan211 Dec 13, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions datafusion-cli/Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

27 changes: 27 additions & 0 deletions datafusion/common/src/types/native.rs
Original file line number Diff line number Diff line change
Expand Up @@ -245,6 +245,8 @@ impl LogicalType for NativeType {
(Self::FixedSizeBinary(size), _) => FixedSizeBinary(*size),
(Self::String, LargeBinary) => LargeUtf8,
(Self::String, BinaryView) => Utf8View,
// We don't cast to another kind of string type if the origin one is already a string type
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

(Self::String, Utf8 | LargeUtf8 | Utf8View) => origin.to_owned(),
(Self::String, data_type) if can_cast_types(data_type, &Utf8View) => Utf8View,
(Self::String, data_type) if can_cast_types(data_type, &LargeUtf8) => {
LargeUtf8
Expand Down Expand Up @@ -433,4 +435,29 @@ impl NativeType {
UInt8 | UInt16 | UInt32 | UInt64 | Int8 | Int16 | Int32 | Int64
)
}

#[inline]
findepi marked this conversation as resolved.
Show resolved Hide resolved
pub fn is_timestamp(&self) -> bool {
matches!(self, NativeType::Timestamp(_, _))
}

#[inline]
pub fn is_date(&self) -> bool {
matches!(self, NativeType::Date)
}

#[inline]
pub fn is_time(&self) -> bool {
matches!(self, NativeType::Time(_))
}

#[inline]
pub fn is_interval(&self) -> bool {
matches!(self, NativeType::Interval(_))
}

#[inline]
pub fn is_duration(&self) -> bool {
matches!(self, NativeType::Duration(_))
}
}
73 changes: 65 additions & 8 deletions datafusion/expr-common/src/signature.rs
Original file line number Diff line number Diff line change
Expand Up @@ -18,8 +18,10 @@
//! Signature module contains foundational types that are used to represent signatures, types,
//! and return types of functions in DataFusion.

use std::fmt::Display;

use crate::type_coercion::aggregates::NUMERICS;
use arrow::datatypes::DataType;
use arrow::datatypes::{DataType, IntervalUnit, TimeUnit};
use datafusion_common::types::{LogicalTypeRef, NativeType};
use itertools::Itertools;

Expand Down Expand Up @@ -112,7 +114,7 @@ pub enum TypeSignature {
/// For example, `Coercible(vec![logical_float64()])` accepts
/// arguments like `vec![DataType::Int32]` or `vec![DataType::Float32]`
/// since i32 and f32 can be casted to f64
Coercible(Vec<LogicalTypeRef>),
Coercible(Vec<TypeSignatureClass>),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

/// The arguments will be coerced to a single type based on the comparison rules.
/// For example, i32 and i64 has coerced type Int64.
///
Expand Down Expand Up @@ -154,6 +156,33 @@ impl TypeSignature {
}
}

/// Represents the class of types that can be used in a function signature.
///
/// This is used to specify what types are valid for function arguments in a more flexible way than
/// just listing specific DataTypes. For example, TypeSignatureClass::Timestamp matches any timestamp
/// type regardless of timezone or precision.
///
/// Used primarily with TypeSignature::Coercible to define function signatures that can accept
/// arguments that can be coerced to a particular class of types.
#[derive(Debug, Clone, Eq, PartialEq, PartialOrd, Hash)]
pub enum TypeSignatureClass {
alamb marked this conversation as resolved.
Show resolved Hide resolved
Timestamp,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may need to treat timestamp and timestamp with zone separately 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless there is function that treat timestamp differently based on timezone, otherwise it is simpler to treat them equivalently.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think some functions currently define that they take a UTC timestamp or ANY timestamp via https://docs.rs/datafusion/latest/datafusion/logical_expr/constant.TIMEZONE_WILDCARD.html

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also feel like we should have another variant for timestamp with time zone. IMO, they are different types.

  • Timestamp(timeunit, None) for timestamp without time zone.
  • Timestamp(timeunit, Some(TIMEZONE_WILDCARD) for timestamp with time zone.

They can't interact with each other before applying some casting or coercion.

For example,

Exact(vec![
DataType::Interval(MonthDayNano),
Timestamp(array_type, None),
Timestamp(Nanosecond, None),
]),
Exact(vec![
DataType::Interval(MonthDayNano),
Timestamp(array_type, Some(TIMEZONE_WILDCARD.into())),
Timestamp(Nanosecond, Some(TIMEZONE_WILDCARD.into())),
]),

The date_bin function accepts two Timestamp arguments. However, if we try to simplify here, we may write something like

  TypeSignature::Coercible(vec![
      TypeSignatureClass::Interval,
      TypeSignatureClass::Timstamp,
      TypeSignatureClass::Timstamp,
  ]),

It means we can accept the SQL like date_bin(INTERVAL 1 HOUR, timestamp_without_timezone_col, timestamp_with_timezone_col) but I guess it's not correct. (no match the original signature)

If we have a class for timestamp with time zone, we can write

TypeSianture::one_of([
  TypeSignature::Coercible(vec![
      TypeSignatureClass::Interval,
      TypeSignatureClass::Timstamp,
      TypeSignatureClass::Timstamp,
  ]),
  TypeSignature::Coercible(vec![
      TypeSignatureClass::Interval,
      TypeSignatureClass::Timstamp_with_time_zone,
      TypeSignatureClass::Timstamp_with_time_zone,
  ]),
])

It's more close to the original signature.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about managing timestamps within the invoke function? Handling specific cases like Timestamp_with_time_zone or TIMEZONE_WILDCARD adds unnecessary complexity to the function's signature without providing much benefit. Instead, why not define a high-level Timestamp in the signature and handle the finer details elsewhere?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can refer to the behavior of Postgres or DuckDB. I listed the signature of the timestamp function age from their system catalogs for reference.

Postgres

test=# SELECT proname, proargtypes FROM pg_proc WHERE proname = 'age';
 proname | proargtypes 
---------+-------------
 age     | 28
 age     | 1114
 age     | 1184
 age     | 1114 1114
 age     | 1184 1184
(5 rows)

test=# SELECT oid, typname FROM pg_type WHERE oid IN (28, 1184, 1114);
 oid  |   typname   
------+-------------
   28 | xid
 1114 | timestamp
 1184 | timestamptz
(3 rows)

DuckDB

D SELECT function_name, parameter_types 
  FROM duckdb_functions() 
 WHERE function_name = 'age';
┌───────────────┬──────────────────────────────────────────────────────┐
│ function_name │                   parameter_types                    │
│    varcharvarchar[]                       │
├───────────────┼──────────────────────────────────────────────────────┤
│ age           │ [TIMESTAMP]                                          │
│ age           │ [TIMESTAMP, TIMESTAMP]                               │
│ age           │ [TIMESTAMP WITH TIME ZONE, TIMESTAMP WITH TIME ZONE] │
│ age           │ [TIMESTAMP WITH TIME ZONE]                           │
└───────────────┴──────────────────────────────────────────────────────┘

If you try to input an unsupported type value, you may encounter an error like the following:

D SELECT age('2001-01-01 18:00:00');
Binder Error: Could not choose a best candidate function for the function call "age(STRING_LITERAL)". In order to select one, please add explicit type casts.
    Candidate functions:
    age(TIMESTAMP WITH TIME ZONE) -> INTERVAL
    age(TIMESTAMP) -> INTERVAL

LINE 1: SELECT age('2001-01-01 18:00:00');

Both systems treat TIMESTAMP and TIMESTAMP WITH TIME ZONE as distinct types in the high level.

The advantage of separating these types is that it allows for stricter type-checking when matching a function's signature. This reduces the likelihood of developers failing to correctly handle type checks when implementing timestamp functions.

Honestly, I'm not sure how complex it would become if we separated them 🤔 . If it requires significant effort, I'm fine with keeping the current design.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both systems treat TIMESTAMP and TIMESTAMP WITH TIME ZONE as distinct types in the high level.

that's a good point, they are quite different: only the latter denotes point in time. the former denotes "wall date/time" with no zone information, so does not denote any particular point in time.

their similarity is deceptive and source of many bugs

This reduces the likelihood of developers failing to correctly handle type checks when implementing timestamp functions.

the timestamp and timestamp tz values are different arrow types, so the function implementor needs to handle them separately anyway

The point is, some functions will be applicable to one of these types but not the other.
for example, a (hypothetical) to_unix_timestamp(timestamp_tz) -> Int64 function should operate on point in time value, so it should accept the timestamp_tz.
Note that in SQL, timestamp is coercible to timestamp_tz, so such function is still going to be callable with timestamp value, but that's not something function implementor should be concerned about.

Copy link
Contributor Author

@jayzhan211 jayzhan211 Dec 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My solution so far is that we use TypeSignatureClass::Timestamp if we don't care about it has timezone or not. Fallback to TypeSignatureClass::Native() if we need to tell the difference

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My solution so far is that we use TypeSignature::Timestamp if we don't care about it has timezone or not. Fallback to TypeSignature::Native() if we need to tell the difference

Agreed. Handling Timestamps using TypeSignatureClass::Native is a good idea. I still think we should treat them as different types.

Date,
Time,
Interval,
Duration,
Native(LogicalTypeRef),
// TODO:
// Numeric
// Integer
}

impl Display for TypeSignatureClass {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
write!(f, "TypeSignatureClass::{self:?}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Display looks more verbose than Debug. Typically it's the other way around.

the produced err msg looks a bit longish, but i don't know how to make it more readabile. thoughts?

Internal error: Expect TypeSignatureClass::Native\(LogicalType\(Native\(Int64\), Int64\)\) but received Float64

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it would be great to have a special type signature class display

Maybe something like (any int) or Integer or (any timestamp) for Timestamp 🤔

}
}

#[derive(Debug, Clone, PartialEq, Eq, PartialOrd, Hash)]
pub enum ArrayFunctionSignature {
/// Specialized Signature for ArrayAppend and similar functions
Expand All @@ -180,7 +209,7 @@ pub enum ArrayFunctionSignature {
MapArray,
}

impl std::fmt::Display for ArrayFunctionSignature {
impl Display for ArrayFunctionSignature {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
match self {
ArrayFunctionSignature::ArrayAndElement => {
Expand Down Expand Up @@ -255,7 +284,7 @@ impl TypeSignature {
}

/// Helper function to join types with specified delimiter.
pub fn join_types<T: std::fmt::Display>(types: &[T], delimiter: &str) -> String {
pub fn join_types<T: Display>(types: &[T], delimiter: &str) -> String {
types
.iter()
.map(|t| t.to_string())
Expand Down Expand Up @@ -290,7 +319,30 @@ impl TypeSignature {
.collect(),
TypeSignature::Coercible(types) => types
.iter()
.map(|logical_type| get_data_types(logical_type.native()))
.map(|logical_type| match logical_type {
TypeSignatureClass::Native(l) => get_data_types(l.native()),
TypeSignatureClass::Timestamp => {
vec![
DataType::Timestamp(TimeUnit::Nanosecond, None),
DataType::Timestamp(
TimeUnit::Nanosecond,
Some(TIMEZONE_WILDCARD.into()),
),
]
}
TypeSignatureClass::Date => {
vec![DataType::Date64]
}
TypeSignatureClass::Time => {
vec![DataType::Time64(TimeUnit::Nanosecond)]
}
TypeSignatureClass::Interval => {
vec![DataType::Interval(IntervalUnit::DayTime)]
}
TypeSignatureClass::Duration => {
vec![DataType::Duration(TimeUnit::Nanosecond)]
}
})
.multi_cartesian_product()
.collect(),
TypeSignature::Variadic(types) => types
Expand Down Expand Up @@ -424,7 +476,10 @@ impl Signature {
}
}
/// Target coerce types in order
pub fn coercible(target_types: Vec<LogicalTypeRef>, volatility: Volatility) -> Self {
pub fn coercible(
target_types: Vec<TypeSignatureClass>,
volatility: Volatility,
) -> Self {
Self {
type_signature: TypeSignature::Coercible(target_types),
volatility,
Expand Down Expand Up @@ -618,8 +673,10 @@ mod tests {
]
);

let type_signature =
TypeSignature::Coercible(vec![logical_string(), logical_int64()]);
let type_signature = TypeSignature::Coercible(vec![
TypeSignatureClass::Native(logical_string()),
TypeSignatureClass::Native(logical_int64()),
]);
let possible_types = type_signature.get_possible_types();
assert_eq!(
possible_types,
Expand Down
82 changes: 58 additions & 24 deletions datafusion/expr/src/type_coercion/functions.rs
Original file line number Diff line number Diff line change
Expand Up @@ -22,14 +22,18 @@ use arrow::{
datatypes::{DataType, TimeUnit},
};
use datafusion_common::{
exec_err, internal_datafusion_err, internal_err, plan_err,
exec_err, internal_datafusion_err, internal_err, not_impl_err, plan_err,
types::{LogicalType, NativeType},
utils::{coerced_fixed_size_list_to_list, list_ndims},
Result,
};
use datafusion_expr_common::{
signature::{ArrayFunctionSignature, FIXED_SIZE_LIST_WILDCARD, TIMEZONE_WILDCARD},
type_coercion::binary::{comparison_coercion_numeric, string_coercion},
signature::{
ArrayFunctionSignature, TypeSignatureClass, FIXED_SIZE_LIST_WILDCARD,
TIMEZONE_WILDCARD,
},
type_coercion::binary::comparison_coercion_numeric,
type_coercion::binary::string_coercion,
};
use std::sync::Arc;

Expand Down Expand Up @@ -568,35 +572,65 @@ fn get_valid_types(
// Make sure the corresponding test is covered
// If this function becomes COMPLEX, create another new signature!
fn can_coerce_to(
logical_type: &NativeType,
target_type: &NativeType,
) -> bool {
if logical_type == target_type {
return true;
}
current_type: &DataType,
target_type_class: &TypeSignatureClass,
) -> Result<DataType> {
let logical_type: NativeType = current_type.into();

if logical_type == &NativeType::Null {
return true;
}
match target_type_class {
TypeSignatureClass::Native(native_type) => {
let target_type = native_type.native();
if &logical_type == target_type {
return target_type.default_cast_for(current_type);
}

if target_type.is_integer() && logical_type.is_integer() {
return true;
}
if logical_type == NativeType::Null {
return target_type.default_cast_for(current_type);
}

if target_type.is_integer() && logical_type.is_integer() {
return target_type.default_cast_for(current_type);
}

false
internal_err!(
"Expect {} but received {}",
target_type_class,
current_type
)
}
// Not consistent with Postgres and DuckDB but to avoid regression we implicit cast string to timestamp
TypeSignatureClass::Timestamp
if logical_type == NativeType::String =>
{
Ok(DataType::Timestamp(TimeUnit::Nanosecond, None))
}
TypeSignatureClass::Timestamp if logical_type.is_timestamp() => {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

timestamp is matched whatever the timezone it has

Ok(current_type.to_owned())
}
TypeSignatureClass::Date if logical_type.is_date() => {
Ok(current_type.to_owned())
}
TypeSignatureClass::Time if logical_type.is_time() => {
Ok(current_type.to_owned())
}
TypeSignatureClass::Interval if logical_type.is_interval() => {
Ok(current_type.to_owned())
}
TypeSignatureClass::Duration if logical_type.is_duration() => {
Ok(current_type.to_owned())
}
_ => {
not_impl_err!("Got logical_type: {logical_type} with target_type_class: {target_type_class}")
}
}
}

let mut new_types = Vec::with_capacity(current_types.len());
for (current_type, target_type) in
for (current_type, target_type_class) in
current_types.iter().zip(target_types.iter())
{
let logical_type: NativeType = current_type.into();
let target_logical_type = target_type.native();
if can_coerce_to(&logical_type, target_logical_type) {
let target_type =
target_logical_type.default_cast_for(current_type)?;
new_types.push(target_type);
}
let target_type = can_coerce_to(current_type, target_type_class)?;
new_types.push(target_type);
}

vec![new_types]
Expand Down
1 change: 1 addition & 0 deletions datafusion/functions/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,7 @@ datafusion-common = { workspace = true }
datafusion-doc = { workspace = true }
datafusion-execution = { workspace = true }
datafusion-expr = { workspace = true }
datafusion-expr-common = { workspace = true }
datafusion-macros = { workspace = true }
hashbrown = { workspace = true, optional = true }
hex = { version = "0.4", optional = true }
Expand Down
Loading
Loading