-
Notifications
You must be signed in to change notification settings - Fork 234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for Arrow PyCapsule Interface #2630
Comments
Yes, I think this is something we'd be happy to support.
I agree that should probably be a RBR.
Yeah I don't think that would makes sense.
Yeah that would make a lot of sense. |
I don't remember if we return a |
If I'm reading this correctly, lance/python/python/lance/dataset.py Lines 2345 to 2346 in fa089be
|
Since LanceSchema has pyarrow interop anyways, Lines 47 to 61 in fa089be
It might as well expose/ingest c schemas too. You could easily reuse the pyarrow dunders if you don't want to manage the rust FFI yourselves |
👋 The Arrow project recently created the Arrow PyCapsule Interface, a new protocol for sharing Arrow data in Python. Among its goals is allowing Arrow data interchange without requiring the use of pyarrow, but I'm also excited about the prospect of an ecosystem that can share data only by the presence of dunder methods, where producer and consumer don't have to have prior knowledge of each other.
I'm trying to promote usage of this protocol throughout the Python Arrow ecosystem.
On the write side, through
write_dataset
, it looks likecoerce_reader
does not yet check for__arrow_c_stream__
. It would be awesome ifcoerce_reader
could check for__arrow_c_stream__
and just callpyarrow.RecordBatchReader.from_stream
. In the longer term, you could potentially remove the pyarrow dependency altogether, though I understand if that's not a priority.On the read side, would you consider changing the return type of
to_batches
to something like apyarrow.RecordBatchReader
? This would potentially not even be a backwards incompatible change, because theRecordBatchReader
still acts as an iterator ofRecordBatch
, but it also has the benefit of holding the Arrow iterator at the C level, so it can be passed to other compiled code without needing to iterate the Python loop.Maybe there are some classes that make sense to have
__arrow_c_stream__
defined on them directly? Maybe theLanceFragment
? It might not make sense if there are still required parameters to materialize an Arrow stream, like a column projection or an expression.Edit: on top, it would also be awesome to integrate the pycapsule interface with
LanceSchema
The text was updated successfully, but these errors were encountered: