Data types to support #26

datapythonista · 2020-08-17T23:31:58Z

What data types should be part of the standard? For the array API, the types have been discussed here.

A good reference for data types for data frames is the Arrow data types documentation. The page probably contains many more types than the ones we want to support in the standard.

Topics to make decisions on:

Which data types should be supported by the standard?
Are implementation expected to provided extra data types? Should we have a list of optional types, or consider out of scope types not part of the standard?
Missing data is discussed separately in Missing Data #9

These are IMO the main types (feel free to disagree):

boolean
int8 / uint8
int16 / uint16
int32 / uint32
int64 / uint64
float32
float64
string (I guess the main use cases is variable length strings, but should we consider fixed length strings?)
categorical (would make sense to have categorical8, categorical16,... for different representations of the categories with uint8, uint16...?)
datetime64 (requires discussion, pandas uses nanoseconds as unit since epoch, which can represent from years 1677 to 2262)

Some other types that could be considered:

decimal
python object
binary
date
time
timedelta
period
complex

And also types based on other types that could be considered:

date + timezone
numeric + unit
interval
struct
list
mapping

TomAugspurger · 2020-08-18T01:13:23Z

What sort of coverage for these data types do other dataframe libraries have? A couple notes on specific dtypes: 1. For categorical, there's some debate about whether the categories (set of valid values) are part of the array or data type. In pandas it's part of the dtype. I believe that in PyArrow they're part of the array's data model. 2. datetimes sound hard, especially with timezones. But they also seem crucial to support. Do we think that the supported precision can be left as an implementation detail for libraries to decide on? And is the `dictionary` dtype in the last section a "mapping" type? You might want to rename that to avoid the clash with pyarrow's dictionary type (which is like pandas Categorical).

…

On Mon, Aug 17, 2020 at 6:32 PM Marc Garcia ***@***.***> wrote: What data types should be part of the standard? For the array API, the types have been discussed here <data-apis/array-api#15>. A good reference for data types for data frames is the Arrow data types documentation <https://arrow.apache.org/docs/cpp/api/datatype.html>. The page probably contains many more types than the ones we want to support in the standard. Topics to make decisions on: - Which data types should be supported by the standard? - Are implementation expected to provided extra data types? Should we have a list of optional types, or consider out of scope types not part of the standard? - Missing data is discussed separately in #9 <#9> These are IMO the main types (feel free to disagree): - boolean - int8 / uint8 - int16 / uint16 - int32 / uint32 - int64 / uint64 - float32 - float64 - string - categorical - datetime64 (requires discussion, pandas uses nanoseconds as unit since epoch, which can represent from years 1677 to 2262 <https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#timestamp-limitations> ) Some other types that could be considered: - decimal - python object - binary - date - time - timedelta - period - complex And also types based on other types that could be considered: - date + timezone - numeric + unit - interval - struct - list - dictionary — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#26>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAKAOIRZAEDPQ3FPORDZD63SBG4XZANCNFSM4QCMTCXA> .

datapythonista · 2020-08-18T11:18:00Z

Thanks Tom, those are great points.

What sort of coverage for these data types do other dataframe libraries have?

I assume Dask provides the same as pandas. For Vaex and Modin I couldn't find information in the documentation. @maartenbreddels @devin-petersohn can you let us know about the data types you implement please?

For categorical, there's some debate about whether the categories (set of valid values) are part of the array or data type. In pandas it's part of the dtype. I believe that in PyArrow they're part of the array's data model.

I guess the API could be defined in a way that we don't need to be opinionated (. But it's a very good point, worth discussing.

datetimes sound hard, especially with timezones. But they also seem crucial to support. Do we think that the supported precision can be left as an implementation detail for libraries to decide on?

Assuming we use int64 representing time units since POSIX epoch, I think consumers should be able to know what's the unit. I'd say that in the same way categories will have some metadata (the category names/values), datetimes can have as metadata the unit. I guess things will become more complex later, when operations are considered, but for the data interchange (we're trying to narrow the discussion for now, see #25), I think just having the list of supported units should be enough. Let me know if I'm missing something or if it doesn't make sense to you.

And is the dictionary dtype in the last section a "mapping" type? You might want to rename that to avoid the clash with pyarrow's dictionary type (which is like pandas Categorical).

Good point, renamed.

TomAugspurger · 2020-08-18T11:33:46Z

I assume Dask provides the same as pandas.

The only difference is that Dask adds the concept of a Categorical with "unknown" categories (in addition to Categoricals with known categories). Otherwise they're the same.

I guess things will become more complex later, when operations are considered

Right, that's the complicating factor for datetimes. But perhaps everyone uses ns-precision and we'll be OK.

jorisvandenbossche · 2020-08-18T13:32:50Z

Another question taking one step back: which "aspects" of the data types do we want to standardize?

It is only about "logical" data types (like "an implementation of the standard should have a timestamp type), or do we also want to standardize the name (eg datetime64 vs timestamp), the implementation details (relating to the question about datetime units, categories of a categorical dtype, etc), the API how to create / compare dtypes in Python, .. ?

datapythonista · 2020-08-18T14:16:32Z

Another question taking one step back: which "aspects" of the data types do we want to standardize?

Good question. In a first stage we care about interchanging data (#25). So, you if receive a "standard" data frame and you can consume its information. I guess we need at least partial name standardization. The consumer needs to know what's in each column of the data frame, so names of types need to be known and understood by the consumer. I guess they don't necessarily need to be the types that implementations expose to their users.

For more complex types like categoricals, I'd say that comparing dtypes is out of scope for this stage. So, consumers should compare them manually (e.g. comparing the categories) if they need to.

We're finishing documentation on the scope and use cases for this first stage. I think these documents will facilitate discussions, particularly about goals and scope, and make sure we're all in the same page.

Does this make sense?

rgommers · 2020-08-19T21:32:39Z

It is only about "logical" data types (like "an implementation of the standard should have a timestamp type), or do we also want to standardize the name (eg datetime64 vs timestamp), the implementation details (relating to the question about datetime units, categories of a categorical dtype, etc), the API how to create / compare dtypes in Python, .. ?

https://github.com/data-apis/workgroup/issues/2 may help answer this. I'd say the goal is standardize names, signatures and expected behaviour, but not implementation related aspects. I think what you call "implementation details" here really is behaviour, so yes either specify it or (e.g. for things that vary too much or are still in flux in libraries) clearly say it's out of scope.

We're finishing documentation on the scope and use cases for this first stage. I think these documents will facilitate discussions, particularly about goals and scope, and make sure we're all in the same page.

Yes I think that will be helpful. In particular also non-goals I suspect.

datapythonista · 2020-08-23T11:47:21Z

For categorical, there's some debate about whether the categories (set
of valid values) are part of the array or data type. In pandas it's part of
the dtype. I believe that in PyArrow they're part of the array's data model.

@TomAugspurger do you remember where this conversation was happening? I think I read about this some time ago, but can't find it now. Thanks!

TomAugspurger · 2020-08-24T11:44:05Z

I don't recall.

…

On Sun, Aug 23, 2020 at 6:47 AM Marc Garcia ***@***.***> wrote: 1. For categorical, there's some debate about whether the categories (set of valid values) are part of the array or data type. In pandas it's part of the dtype. I believe that in PyArrow they're part of the array's data model. @TomAugspurger <https://github.com/TomAugspurger> do you remember where this conversation was happening? I think I read about this some time ago, but can't find it now. Thanks! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#26 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAKAOIUZVK6CQFRJJJUCJF3SCD6VJANCNFSM4QCMTCXA> .

devin-petersohn · 2020-08-24T13:22:44Z

@maartenbreddels @devin-petersohn can you let us know about the data types you implement please?

Modin handles types identical to how pandas does, including upcasts and type inferencing. We reuse the pandas and numpy types.

I do think that some object data type is important to support for data that is in the middle of being cleaned (e.g., string in float data). This is one of the things that dataframes handle more elegantly than tables: a database would not allow that data in the first place.

jorisvandenbossche · 2020-08-24T13:41:38Z

I believe that in PyArrow they're part of the array's data model.

In Arrow, the categories are indeed no longer part of the type itself. This was changed in https://issues.apache.org/jira/browse/ARROW-3144 / apache/arrow#4316 (which contains some motivation)

gatorsmile · 2020-08-27T16:53:47Z

PySpark is following the ANSI SQL, although we have not supported all the data types yet.

Recently, we are revisiting the type coercion rules and type casting. Below is the type casting spec from ANSI SQL:

kkraus14 · 2020-08-27T22:00:53Z

Looped in with #9, I would add that a null dtype or something similar to it which allows us to avoid having actual data/allocations underneath the column would be at least a nice optional dtype.

This possibly conflicts with some of the issues of typed nulls earlier in the thread, but having to special case handling of columns without memory underneath them sounds problematic to accomplish the same with existing dtypes.

datapythonista mentioned this issue Aug 17, 2020

Dataframe interchange protocol #25

Closed

This was referenced Apr 8, 2021

Add a prototype of the dataframe interchange protocol #38

Merged

storing & exchange of categorical dtypes #41

Open

rgommers changed the title ~~Data types~~ Data types to support Apr 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data types to support #26

Data types to support #26

datapythonista commented Aug 17, 2020 •

edited

Loading

TomAugspurger commented Aug 18, 2020 via email

datapythonista commented Aug 18, 2020

TomAugspurger commented Aug 18, 2020

jorisvandenbossche commented Aug 18, 2020

datapythonista commented Aug 18, 2020

rgommers commented Aug 19, 2020

datapythonista commented Aug 23, 2020

TomAugspurger commented Aug 24, 2020 via email

devin-petersohn commented Aug 24, 2020

jorisvandenbossche commented Aug 24, 2020

gatorsmile commented Aug 27, 2020

kkraus14 commented Aug 27, 2020

Data types to support #26

Data types to support #26

Comments

datapythonista commented Aug 17, 2020 • edited Loading

TomAugspurger commented Aug 18, 2020 via email

datapythonista commented Aug 18, 2020

TomAugspurger commented Aug 18, 2020

jorisvandenbossche commented Aug 18, 2020

datapythonista commented Aug 18, 2020

rgommers commented Aug 19, 2020

datapythonista commented Aug 23, 2020

TomAugspurger commented Aug 24, 2020 via email

devin-petersohn commented Aug 24, 2020

jorisvandenbossche commented Aug 24, 2020

gatorsmile commented Aug 27, 2020

kkraus14 commented Aug 27, 2020

datapythonista commented Aug 17, 2020 •

edited

Loading