Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data types to support #26

Open
datapythonista opened this issue Aug 17, 2020 · 12 comments
Open

Data types to support #26

datapythonista opened this issue Aug 17, 2020 · 12 comments

Comments

@datapythonista
Copy link
Member

datapythonista commented Aug 17, 2020

What data types should be part of the standard? For the array API, the types have been discussed here.

A good reference for data types for data frames is the Arrow data types documentation. The page probably contains many more types than the ones we want to support in the standard.

Topics to make decisions on:

  • Which data types should be supported by the standard?
  • Are implementation expected to provided extra data types? Should we have a list of optional types, or consider out of scope types not part of the standard?
  • Missing data is discussed separately in Missing Data #9

These are IMO the main types (feel free to disagree):

  • boolean
  • int8 / uint8
  • int16 / uint16
  • int32 / uint32
  • int64 / uint64
  • float32
  • float64
  • string (I guess the main use cases is variable length strings, but should we consider fixed length strings?)
  • categorical (would make sense to have categorical8, categorical16,... for different representations of the categories with uint8, uint16...?)
  • datetime64 (requires discussion, pandas uses nanoseconds as unit since epoch, which can represent from years 1677 to 2262)

Some other types that could be considered:

  • decimal
  • python object
  • binary
  • date
  • time
  • timedelta
  • period
  • complex

And also types based on other types that could be considered:

  • date + timezone
  • numeric + unit
  • interval
  • struct
  • list
  • mapping
@TomAugspurger
Copy link

TomAugspurger commented Aug 18, 2020 via email

@datapythonista
Copy link
Member Author

Thanks Tom, those are great points.

What sort of coverage for these data types do other dataframe libraries have?

I assume Dask provides the same as pandas. For Vaex and Modin I couldn't find information in the documentation. @maartenbreddels @devin-petersohn can you let us know about the data types you implement please?

  1. For categorical, there's some debate about whether the categories (set of valid values) are part of the array or data type. In pandas it's part of the dtype. I believe that in PyArrow they're part of the array's data model.

I guess the API could be defined in a way that we don't need to be opinionated (. But it's a very good point, worth discussing.

  1. datetimes sound hard, especially with timezones. But they also seem crucial to support. Do we think that the supported precision can be left as an implementation detail for libraries to decide on?

Assuming we use int64 representing time units since POSIX epoch, I think consumers should be able to know what's the unit. I'd say that in the same way categories will have some metadata (the category names/values), datetimes can have as metadata the unit. I guess things will become more complex later, when operations are considered, but for the data interchange (we're trying to narrow the discussion for now, see #25), I think just having the list of supported units should be enough. Let me know if I'm missing something or if it doesn't make sense to you.

And is the dictionary dtype in the last section a "mapping" type? You might want to rename that to avoid the clash with pyarrow's dictionary type (which is like pandas Categorical).

Good point, renamed.

@TomAugspurger
Copy link

I assume Dask provides the same as pandas.

The only difference is that Dask adds the concept of a Categorical with "unknown" categories (in addition to Categoricals with known categories). Otherwise they're the same.

I guess things will become more complex later, when operations are considered

Right, that's the complicating factor for datetimes. But perhaps everyone uses ns-precision and we'll be OK.

@jorisvandenbossche
Copy link
Member

Another question taking one step back: which "aspects" of the data types do we want to standardize?

It is only about "logical" data types (like "an implementation of the standard should have a timestamp type), or do we also want to standardize the name (eg datetime64 vs timestamp), the implementation details (relating to the question about datetime units, categories of a categorical dtype, etc), the API how to create / compare dtypes in Python, .. ?

@datapythonista
Copy link
Member Author

Another question taking one step back: which "aspects" of the data types do we want to standardize?

Good question. In a first stage we care about interchanging data (#25). So, you if receive a "standard" data frame and you can consume its information. I guess we need at least partial name standardization. The consumer needs to know what's in each column of the data frame, so names of types need to be known and understood by the consumer. I guess they don't necessarily need to be the types that implementations expose to their users.

For more complex types like categoricals, I'd say that comparing dtypes is out of scope for this stage. So, consumers should compare them manually (e.g. comparing the categories) if they need to.

We're finishing documentation on the scope and use cases for this first stage. I think these documents will facilitate discussions, particularly about goals and scope, and make sure we're all in the same page.

Does this make sense?

@rgommers
Copy link
Member

It is only about "logical" data types (like "an implementation of the standard should have a timestamp type), or do we also want to standardize the name (eg datetime64 vs timestamp), the implementation details (relating to the question about datetime units, categories of a categorical dtype, etc), the API how to create / compare dtypes in Python, .. ?

https://github.com/data-apis/workgroup/issues/2 may help answer this. I'd say the goal is standardize names, signatures and expected behaviour, but not implementation related aspects. I think what you call "implementation details" here really is behaviour, so yes either specify it or (e.g. for things that vary too much or are still in flux in libraries) clearly say it's out of scope.

We're finishing documentation on the scope and use cases for this first stage. I think these documents will facilitate discussions, particularly about goals and scope, and make sure we're all in the same page.

Yes I think that will be helpful. In particular also non-goals I suspect.

@datapythonista
Copy link
Member Author

  1. For categorical, there's some debate about whether the categories (set
    of valid values) are part of the array or data type. In pandas it's part of
    the dtype. I believe that in PyArrow they're part of the array's data model.

@TomAugspurger do you remember where this conversation was happening? I think I read about this some time ago, but can't find it now. Thanks!

@TomAugspurger
Copy link

TomAugspurger commented Aug 24, 2020 via email

@devin-petersohn
Copy link
Member

@maartenbreddels @devin-petersohn can you let us know about the data types you implement please?

Modin handles types identical to how pandas does, including upcasts and type inferencing. We reuse the pandas and numpy types.

I do think that some object data type is important to support for data that is in the middle of being cleaned (e.g., string in float data). This is one of the things that dataframes handle more elegantly than tables: a database would not allow that data in the first place.

@jorisvandenbossche
Copy link
Member

I believe that in PyArrow they're part of the array's data model.

In Arrow, the categories are indeed no longer part of the type itself. This was changed in https://issues.apache.org/jira/browse/ARROW-3144 / apache/arrow#4316 (which contains some motivation)

@gatorsmile
Copy link

PySpark is following the ANSI SQL, although we have not supported all the data types yet.

Recently, we are revisiting the type coercion rules and type casting. Below is the type casting spec from ANSI SQL:

image

@kkraus14
Copy link
Collaborator

Looped in with #9, I would add that a null dtype or something similar to it which allows us to avoid having actual data/allocations underneath the column would be at least a nice optional dtype.

This possibly conflicts with some of the issues of typed nulls earlier in the thread, but having to special case handling of columns without memory underneath them sounds problematic to accomplish the same with existing dtypes.

@rgommers rgommers changed the title Data types Data types to support Apr 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants