-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data types to support #26
Comments
What sort of coverage for these data types do other dataframe libraries
have?
A couple notes on specific dtypes:
1. For categorical, there's some debate about whether the categories (set
of valid values) are part of the array or data type. In pandas it's part of
the dtype. I believe that in PyArrow they're part of the array's data model.
2. datetimes sound hard, especially with timezones. But they also seem
crucial to support. Do we think that the supported precision can be left as
an implementation detail for libraries to decide on?
And is the `dictionary` dtype in the last section a "mapping" type? You
might want to rename that to avoid the clash with pyarrow's dictionary type
(which is like pandas Categorical).
…On Mon, Aug 17, 2020 at 6:32 PM Marc Garcia ***@***.***> wrote:
What data types should be part of the standard? For the array API, the
types have been discussed here
<data-apis/array-api#15>.
A good reference for data types for data frames is the Arrow data types
documentation <https://arrow.apache.org/docs/cpp/api/datatype.html>. The
page probably contains many more types than the ones we want to support in
the standard.
Topics to make decisions on:
- Which data types should be supported by the standard?
- Are implementation expected to provided extra data types? Should we
have a list of optional types, or consider out of scope types not part of
the standard?
- Missing data is discussed separately in #9
<#9>
These are IMO the main types (feel free to disagree):
- boolean
- int8 / uint8
- int16 / uint16
- int32 / uint32
- int64 / uint64
- float32
- float64
- string
- categorical
- datetime64 (requires discussion, pandas uses nanoseconds as unit
since epoch, which can represent from years 1677 to 2262
<https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#timestamp-limitations>
)
Some other types that could be considered:
- decimal
- python object
- binary
- date
- time
- timedelta
- period
- complex
And also types based on other types that could be considered:
- date + timezone
- numeric + unit
- interval
- struct
- list
- dictionary
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#26>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAKAOIRZAEDPQ3FPORDZD63SBG4XZANCNFSM4QCMTCXA>
.
|
Thanks Tom, those are great points.
I assume Dask provides the same as pandas. For Vaex and Modin I couldn't find information in the documentation. @maartenbreddels @devin-petersohn can you let us know about the data types you implement please?
I guess the API could be defined in a way that we don't need to be opinionated (. But it's a very good point, worth discussing.
Assuming we use int64 representing time units since POSIX epoch, I think consumers should be able to know what's the unit. I'd say that in the same way categories will have some metadata (the category names/values), datetimes can have as metadata the unit. I guess things will become more complex later, when operations are considered, but for the data interchange (we're trying to narrow the discussion for now, see #25), I think just having the list of supported units should be enough. Let me know if I'm missing something or if it doesn't make sense to you.
Good point, renamed. |
The only difference is that Dask adds the concept of a Categorical with "unknown" categories (in addition to Categoricals with known categories). Otherwise they're the same.
Right, that's the complicating factor for datetimes. But perhaps everyone uses ns-precision and we'll be OK. |
Another question taking one step back: which "aspects" of the data types do we want to standardize? It is only about "logical" data types (like "an implementation of the standard should have a timestamp type), or do we also want to standardize the name (eg datetime64 vs timestamp), the implementation details (relating to the question about datetime units, categories of a categorical dtype, etc), the API how to create / compare dtypes in Python, .. ? |
Good question. In a first stage we care about interchanging data (#25). So, you if receive a "standard" data frame and you can consume its information. I guess we need at least partial name standardization. The consumer needs to know what's in each column of the data frame, so names of types need to be known and understood by the consumer. I guess they don't necessarily need to be the types that implementations expose to their users. For more complex types like categoricals, I'd say that comparing dtypes is out of scope for this stage. So, consumers should compare them manually (e.g. comparing the categories) if they need to. We're finishing documentation on the scope and use cases for this first stage. I think these documents will facilitate discussions, particularly about goals and scope, and make sure we're all in the same page. Does this make sense? |
https://github.com/data-apis/workgroup/issues/2 may help answer this. I'd say the goal is standardize names, signatures and expected behaviour, but not implementation related aspects. I think what you call "implementation details" here really is behaviour, so yes either specify it or (e.g. for things that vary too much or are still in flux in libraries) clearly say it's out of scope.
Yes I think that will be helpful. In particular also non-goals I suspect. |
@TomAugspurger do you remember where this conversation was happening? I think I read about this some time ago, but can't find it now. Thanks! |
I don't recall.
…On Sun, Aug 23, 2020 at 6:47 AM Marc Garcia ***@***.***> wrote:
1. For categorical, there's some debate about whether the categories
(set
of valid values) are part of the array or data type. In pandas it's
part of
the dtype. I believe that in PyArrow they're part of the array's data
model.
@TomAugspurger <https://github.com/TomAugspurger> do you remember where
this conversation was happening? I think I read about this some time ago,
but can't find it now. Thanks!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#26 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAKAOIUZVK6CQFRJJJUCJF3SCD6VJANCNFSM4QCMTCXA>
.
|
Modin handles types identical to how pandas does, including upcasts and type inferencing. We reuse the pandas and numpy types. I do think that some |
In Arrow, the categories are indeed no longer part of the type itself. This was changed in https://issues.apache.org/jira/browse/ARROW-3144 / apache/arrow#4316 (which contains some motivation) |
PySpark is following the ANSI SQL, although we have not supported all the data types yet. Recently, we are revisiting the type coercion rules and type casting. Below is the type casting spec from ANSI SQL: |
Looped in with #9, I would add that a This possibly conflicts with some of the issues of typed nulls earlier in the thread, but having to special case handling of columns without memory underneath them sounds problematic to accomplish the same with existing dtypes. |
What data types should be part of the standard? For the array API, the types have been discussed here.
A good reference for data types for data frames is the Arrow data types documentation. The page probably contains many more types than the ones we want to support in the standard.
Topics to make decisions on:
These are IMO the main types (feel free to disagree):
Some other types that could be considered:
And also types based on other types that could be considered:
The text was updated successfully, but these errors were encountered: