Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Arrow PyCapsule Interface #2630

Open
kylebarron opened this issue Jul 22, 2024 · 4 comments
Open

Support for Arrow PyCapsule Interface #2630

kylebarron opened this issue Jul 22, 2024 · 4 comments

Comments

@kylebarron
Copy link

kylebarron commented Jul 22, 2024

👋 The Arrow project recently created the Arrow PyCapsule Interface, a new protocol for sharing Arrow data in Python. Among its goals is allowing Arrow data interchange without requiring the use of pyarrow, but I'm also excited about the prospect of an ecosystem that can share data only by the presence of dunder methods, where producer and consumer don't have to have prior knowledge of each other.

I'm trying to promote usage of this protocol throughout the Python Arrow ecosystem.

On the write side, through write_dataset, it looks like coerce_reader does not yet check for __arrow_c_stream__. It would be awesome if coerce_reader could check for __arrow_c_stream__ and just call pyarrow.RecordBatchReader.from_stream. In the longer term, you could potentially remove the pyarrow dependency altogether, though I understand if that's not a priority.

On the read side, would you consider changing the return type of to_batches to something like a pyarrow.RecordBatchReader? This would potentially not even be a backwards incompatible change, because the RecordBatchReader still acts as an iterator of RecordBatch, but it also has the benefit of holding the Arrow iterator at the C level, so it can be passed to other compiled code without needing to iterate the Python loop.

Maybe there are some classes that make sense to have __arrow_c_stream__ defined on them directly? Maybe the LanceFragment? It might not make sense if there are still required parameters to materialize an Arrow stream, like a column projection or an expression.

Edit: on top, it would also be awesome to integrate the pycapsule interface with LanceSchema

@wjones127
Copy link
Contributor

Yes, I think this is something we'd be happy to support.

On the read side, would you consider changing the return type of to_batches to something like a pyarrow.RecordBatchReader? This would potentially not even be a backwards incompatible change, because the RecordBatchReader still acts as an iterator of RecordBatch,

I agree that should probably be a RBR.

Maybe there are some classes that make sense to have arrow_c_stream defined on them directly? Maybe the LanceFragment? It might not make sense if there are still required parameters to materialize an Arrow stream, like a column projection or an expression.

Yeah I don't think that would makes sense. LanceFragment doesn't represent in-memory data, just something on disk that can be scanned. I think it should instead just have a to_batches() method, which I believe it does.

Edit: on top, it would also be awesome to integrate the pycapsule interface with LanceSchema

Yeah that would make a lot of sense.

@westonpace
Copy link
Contributor

On the read side, would you consider changing the return type of to_batches to something like a pyarrow.RecordBatchReader? This would potentially not even be a backwards incompatible change, because the RecordBatchReader still acts as an iterator of RecordBatch, but it also has the benefit of holding the Arrow iterator at the C level, so it can be passed to other compiled code without needing to iterate the Python loop.

I don't remember if we return a RecordBatchReader here already or not. However, if we don't, I agree we should be returning something that supports __arrow_c_stream__. Other than the inputs / outputs, which I think you have covered (also, merge_insert, and wherever else we accept / consume RBR), I'm not sure we have much else mapping to __arrow_c_stream__.

@kylebarron
Copy link
Author

I don't remember if we return a RecordBatchReader here already or not. However, if we don't, I agree we should be returning something that supports __arrow_c_stream__.

If I'm reading this correctly, to_batches currently returns a Python iterator

def to_batches(self) -> Iterator[RecordBatch]:
yield from self.to_reader()

@kylebarron
Copy link
Author

Since LanceSchema has pyarrow interop anyways,

/// Convert the schema to a PyArrow schema.
pub fn to_pyarrow(&self) -> PyArrowType<ArrowSchema> {
PyArrowType(ArrowSchema::from(&self.0))
}
/// Create a Lance schema from a PyArrow schema.
///
/// This will assign field ids in depth-first order. Be aware this may not
/// match the correct schema for a particular table.
#[staticmethod]
pub fn from_pyarrow(schema: PyArrowType<ArrowSchema>) -> PyResult<Self> {
let schema = Schema::try_from(&schema.0)
.map_err(|err| PyValueError::new_err(format!("Failed to convert schema: {}", err)))?;
Ok(Self(schema))
}

It might as well expose/ingest c schemas too. You could easily reuse the pyarrow dunders if you don't want to manage the rust FFI yourselves

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants