Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify dtype type in ColumnBuffers elements #87

Merged
merged 3 commits into from
Oct 12, 2022

Conversation

flexatone
Copy link
Contributor

@flexatone flexatone commented Sep 24, 2022

Thanks for this team's efforts in implementing the dataframe interchange protocol.

While experimenting on an implementation for StaticFrame (static-frame/static-frame#621), I found one aspect of the ABC and associated types unclear.

The ColumnBuffers TypedDict defines its elements as pairs of Buffer and a dtype, but only types the dtype as Any. I assume that the dtype should be provided as the same tuple returned from Column.dtype. This PR makes this explicit by defining a Dtype type alias and using it for the return value of Column.dtype as well as the elements in ColumnBuffers.

If my assumption is incorrect, it would be helpful to specify the type of the dtype values returned from ColumnBuffers.

Copy link
Member

@honno honno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I think this would be useful.

If my assumption is incorrect

Just as a sanity check I checked and pandas/vaex/cuDF/modin all return the type described in Column.dtype for the get_buffers() values. A. TODO I realize for dataframe-interchange-tests is generalise test_dtype and use it in test_get_buffers.

protocol/dataframe_protocol.py Outdated Show resolved Hide resolved
Copy link
Member

@rgommers rgommers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @flexatone, this looks correct to me and is a useful improvement. And thanks @honno for the review.

Also nice to learn about StaticFrame!

@jorisvandenbossche
Copy link
Member

If my assumption is incorrect

Just as a sanity check I checked and pandas/vaex/cuDF/modin all return the type described in Column.dtype for the get_buffers() values. A. TODO I realize for dataframe-interchange-tests is generalise test_dtype and use it in test_get_buffers.

Note that this is actually not correct, depending on how you interpret it. Yes, the buffers' dtype returns a similar type of DType tuple, but it should not necessarily return the same dtype tuple as its Column.dtype does, as the buffer can have a different dtype than the column.

It seems that we all interpreted this wrongly and all implementations got this wrong (or the text about "the data buffer's associated dtype" is wrong), see apache/arrow#37598, pandas-dev/pandas#54781, pola-rs/polars#10787 (and the same for StaticFrame mentioned above, from a quick look).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants