Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specification of type system #1521

Closed
leiysky opened this issue Aug 19, 2021 · 11 comments
Closed

Specification of type system #1521

leiysky opened this issue Aug 19, 2021 · 11 comments
Labels
C-improvement Category: improvement

Comments

@leiysky
Copy link
Contributor

leiysky commented Aug 19, 2021

It's important to provide a well defined type system.

In ANSI SQL specification, there are many data types, while all of the databases have their own type systems that always have difference fron ANSI SQL.

The most complicated part is the implicit conversion between data types.

For example, there isn't a real BOOLEAN type in MySQL. They represents BOOLEAN with TINYINT(1), thus literal TRUE value is actually 1 with type TINYINT.

Besides, MySQL allows implicit conversion from string type to float, that is, if you write down a SQL query like SELECT SUM(char_column) FROM table, MySQL can still produce a bizzare result(it will implicitly convert char_column into a double_column, and the conversion rule is not well defined).

While PostgreSQL do have BOOLEAN type, and it doesn't allow non-reasonable implicit conversion.

We've discussed about this in #1492, and the detailed discussion about type system will be finished in this issue.

@leiysky leiysky added the C-improvement Category: improvement label Aug 19, 2021
@leiysky
Copy link
Contributor Author

leiysky commented Aug 19, 2021

cc @sundy-li

@sundy-li
Copy link
Member

sundy-li commented Aug 19, 2021

Thanks for the issue, I think in the alpha version, the types we must have are:

PrimitiveTypes:

  • UInt8
  • UInt16
  • UInt32
  • UInt64
  • Int8
  • Int16
  • Int32
  • Int64
  • Float32
  • Float64

UInt8/Int8 aka tinyint, UInt16/Int16 aka smallint, UInt32/Int32 aka int, UInt64/Int64 aka bigint, Float32 aka float, Float64 aka double.

Boolean

  • Boolean
    Some database system takes UInt8/tinyint(1) as Boolean, but In Datafuse I think it's ok to have a native boolean type, which stores one value as a bit. I think it can work as well as UInt8.

String, Binary

  • Utf8 <---> LargeUtf8 in arrow2
  • Binary <---> LargeBinary in arrow2

Binary can cover the case of Utf8, seems we can keep Binary type and allow users to add charset constraints to the type.

Complex types

  • List(Box) <---> LargeList in arrow2
  • Struct(Vec)

List aka Array, Struct aka tuple.

Date, DateTime Types

Date, DateTime types are logic types, we can map these types into primitive types.
In Arrow's date Specification, they are signed types. I don't thinks we must follow arrow's specification, we can have our own date/datetime types.

  • Date16 <--> UInt16

  • Date32 <--> UInt32

  • Date64 <--> UInt64

  • DateTime32 <--> UInt32

  • DateTime64 <--> UInt64

@leiysky
Copy link
Contributor Author

leiysky commented Aug 19, 2021

PrimitiveTypes:

Maybe we should involve DECIMAL type as well?

@jorgecarleitao
Copy link

Note that Utf8 and Binary have a maximum number of characters (over all items) of i32:MAX; this is not a per-item constraint; sum over all characters over all items < i32:MAX.

wrt to not following the arrow spec, the constraint there is that if you need FFI (e.g. UDFs written in python for ML and things like that), it will be more difficult to integrate those types. I am not sure that is a problem per se; just mentioning.

@sundy-li
Copy link
Member

sundy-li commented Aug 19, 2021

Maybe we should involve the DECIMAL type as well?

Yes, we can add it later, add Decimal(P, S) after the alpha version.

Note that Utf8 and Binary have a maximum number of characters (over all items) of i32:MAX; this is not a per-item
constraint; sum over all characters over all items < i32:MAX.

Thanks, in Datafuse, we take Utf8 and Binary as LargeUtf8 and LargeBinary in arrow2. Because using i32 to represent the offset is really not enough for large data. I even want to change to offset to be u64 instead of i64, but it will destroy the arrow spec.

the constraint there is that if you need FFI

That's true, still need some considerations and discussions about that.

@jorgecarleitao
Copy link

out of curiosity, why 65535? I though that that was the max of u16:

fn main() {
    println!("i32::MAX: {}", i32::MAX);
    println!("i16::MAX: {}", i16::MAX);
    println!("u16::MAX: {}", u16::MAX);
}

prints

i32::MAX: 2147483647
i16::MAX: 32767
u16::MAX: 65535

@sundy-li
Copy link
Member

sundy-li commented Aug 19, 2021

out of curiosity, why 65535

Oh, that's my bad.

@leiysky
Copy link
Contributor Author

leiysky commented Aug 20, 2021

Hi @sundy-li, it seems this specification basically follows the DataType defined here: https://github.com/datafuselabs/datafuse/blob/681efe24288e41adacd9e5ce978f361a6d37ab42/common/datavalues/src/types/data_type.rs#L23-L51

I have need for this in #1538 and #1537 , can I just treat it as a stable API and rely on it?

@sundy-li
Copy link
Member

Can I just treat it as a stable API and rely on it?

Yes, currently we treat it as our own type.

@Xuanwo
Copy link
Member

Xuanwo commented Jul 2, 2022

FYI @andylokandy

@Xuanwo
Copy link
Member

Xuanwo commented Sep 16, 2022

@leiysky, maybe we can close this issue now?

@Xuanwo Xuanwo closed this as completed Oct 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-improvement Category: improvement
Projects
None yet
Development

No branches or pull requests

4 participants