-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python] Conventions around PyCapsule Interface and choosing Array/Stream export #40648
Comments
For reference, the PR implementing the
My take on this is that as long as the object has an unambiguous interpretation as a contiguous array (or might have one, since it might take a loop over something that is not already Arrow data to figure this out), I think it's fine for For something like an There are other assumptions that can't be captured by the mere existence of either of those, like exactly how expensive it will be to call any one of those methods. In shapely/shapely#1953 both are fairly expensive because the data are not Arrow yet. For a database driver, it might expensive to consume the stream because the data haven't arrived over the network yet. The Python buffer protocol has a For consuming in nanoarrow, the current approach is to use |
Being able to infer the input structure also significantly helps static typing. For example, I have type hints that I'm writing for geoarrow-rust that include: @overload
def centroid(input: ArrowArrayExportable) -> PointArray: ...
@overload
def centroid(input: ArrowStreamExportable) -> ChunkedPointArray: ...
def centroid(
input: ArrowArrayExportable | ArrowStreamExportable,
) -> PointArray | ChunkedPointArray: ... I'm not sure which overload a type checker would pick if the input object had both dunder methods. I suppose it would always return the union. But being able to use structural types in this way is quite useful for static type checking and IDE autocompletion, which are really sore spots right now with pyarrow. |
Does
So your argument is that |
import pyarrow as pa
type(pa.array(["a" * 2 ** 20 for _ in range(2**10)]))
#> pyarrow.lib.StringArray
type(pa.array(["a" * 2 ** 20 for _ in range(2**11)]))
#> pyarrow.lib.ChunkedArray This is also true of the
Would the typing hints be significantly different for a
Maybe could is more like it. pyarrow has an |
Describe the usage question you have. Please include as many useful details as possible.
👋 I've been excited about the PyCapsule interface, and have been implementing it in my geoarrow-rust project. Every function call accepts any Arrow PyCapsule interface object, no matter its producer. It's really amazing!
Fundamentally, my question is whether the existence of methods on an object should allow for an inference of its storage type. That is, should it be possible to observe whether a producer object is chunked or not based on whether it exports
__arrow_c_array__
or__arrow_c_stream__
? I had been expecting yes, as pyarrow implements only the former onArray
andRecordBatch
and only the latter onChunkedArray
andTable
(to my knowledge). But this question came up here, where nanoarrow implements both__arrow_c_array__
and__arrow_c_stream__
I'd argue that it's simpler to only define a single type of export method on a class and allow the consumer to convert to a different representation if they need. This communicates more information about how the existing data is already stored in memory. But in general I think it's really useful if the community is able to agree on a convention here, which will inform whether consumers can expect this invariant to hold or not.
Component(s)
Python
The text was updated successfully, but these errors were encountered: