Support for Arrow PyCapsule Interface #2630

kylebarron · 2024-07-22T20:50:04Z

👋 The Arrow project recently created the Arrow PyCapsule Interface, a new protocol for sharing Arrow data in Python. Among its goals is allowing Arrow data interchange without requiring the use of pyarrow, but I'm also excited about the prospect of an ecosystem that can share data only by the presence of dunder methods, where producer and consumer don't have to have prior knowledge of each other.

I'm trying to promote usage of this protocol throughout the Python Arrow ecosystem.

On the write side, through write_dataset, it looks like coerce_reader does not yet check for __arrow_c_stream__. It would be awesome if coerce_reader could check for __arrow_c_stream__ and just call pyarrow.RecordBatchReader.from_stream. In the longer term, you could potentially remove the pyarrow dependency altogether, though I understand if that's not a priority.

On the read side, would you consider changing the return type of to_batches to something like a pyarrow.RecordBatchReader? This would potentially not even be a backwards incompatible change, because the RecordBatchReader still acts as an iterator of RecordBatch, but it also has the benefit of holding the Arrow iterator at the C level, so it can be passed to other compiled code without needing to iterate the Python loop.

Maybe there are some classes that make sense to have __arrow_c_stream__ defined on them directly? Maybe the LanceFragment? It might not make sense if there are still required parameters to materialize an Arrow stream, like a column projection or an expression.

Edit: on top, it would also be awesome to integrate the pycapsule interface with LanceSchema

The text was updated successfully, but these errors were encountered:

wjones127 · 2024-07-22T21:57:43Z

Yes, I think this is something we'd be happy to support.

On the read side, would you consider changing the return type of to_batches to something like a pyarrow.RecordBatchReader? This would potentially not even be a backwards incompatible change, because the RecordBatchReader still acts as an iterator of RecordBatch,

I agree that should probably be a RBR.

Maybe there are some classes that make sense to have arrow_c_stream defined on them directly? Maybe the LanceFragment? It might not make sense if there are still required parameters to materialize an Arrow stream, like a column projection or an expression.

Yeah I don't think that would makes sense. LanceFragment doesn't represent in-memory data, just something on disk that can be scanned. I think it should instead just have a to_batches() method, which I believe it does.

Edit: on top, it would also be awesome to integrate the pycapsule interface with LanceSchema

Yeah that would make a lot of sense.

westonpace · 2024-07-23T05:30:04Z

On the read side, would you consider changing the return type of to_batches to something like a pyarrow.RecordBatchReader? This would potentially not even be a backwards incompatible change, because the RecordBatchReader still acts as an iterator of RecordBatch, but it also has the benefit of holding the Arrow iterator at the C level, so it can be passed to other compiled code without needing to iterate the Python loop.

I don't remember if we return a RecordBatchReader here already or not. However, if we don't, I agree we should be returning something that supports __arrow_c_stream__. Other than the inputs / outputs, which I think you have covered (also, merge_insert, and wherever else we accept / consume RBR), I'm not sure we have much else mapping to __arrow_c_stream__.

kylebarron · 2024-07-23T18:58:05Z

I don't remember if we return a RecordBatchReader here already or not. However, if we don't, I agree we should be returning something that supports __arrow_c_stream__.

If I'm reading this correctly, to_batches currently returns a Python iterator

lance/python/python/lance/dataset.py

Lines 2345 to 2346 in fa089be

    
           def to_batches(self) -> Iterator[RecordBatch]: 
        
               yield from self.to_reader()

kylebarron · 2024-07-23T19:00:40Z

Since LanceSchema has pyarrow interop anyways,

lance/python/src/schema.rs

Lines 47 to 61 in fa089be

    
               /// Convert the schema to a PyArrow schema. 
        
               pub fn to_pyarrow(&self) -> PyArrowType<ArrowSchema> { 
        
                   PyArrowType(ArrowSchema::from(&self.0)) 
        
               } 
        
               /// Create a Lance schema from a PyArrow schema. 
        
               /// 
        
               /// This will assign field ids in depth-first order. Be aware this may not 
        
               /// match the correct schema for a particular table. 
        
               #[staticmethod] 
        
               pub fn from_pyarrow(schema: PyArrowType<ArrowSchema>) -> PyResult<Self> { 
        
                   let schema = Schema::try_from(&schema.0) 
        
                       .map_err(|err| PyValueError::new_err(format!("Failed to convert schema: {}", err)))?; 
        
                   Ok(Self(schema)) 
        
               }

It might as well expose/ingest c schemas too. You could easily reuse the pyarrow dunders if you don't want to manage the rust FFI yourselves

kylebarron mentioned this issue Jul 23, 2024

[Python] Promote usage of the Arrow PyCapsule Protocol (for the C Data Inteface) apache/arrow#39195

Open

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for Arrow PyCapsule Interface #2630

Support for Arrow PyCapsule Interface #2630

kylebarron commented Jul 22, 2024 •

edited

Loading

wjones127 commented Jul 22, 2024

westonpace commented Jul 23, 2024

kylebarron commented Jul 23, 2024

kylebarron commented Jul 23, 2024

Support for Arrow PyCapsule Interface #2630

Support for Arrow PyCapsule Interface #2630

Comments

kylebarron commented Jul 22, 2024 • edited Loading

wjones127 commented Jul 22, 2024

westonpace commented Jul 23, 2024

kylebarron commented Jul 23, 2024

kylebarron commented Jul 23, 2024

kylebarron commented Jul 22, 2024 •

edited

Loading