Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read Arrow C Stream from Arrow PyCapsule Interface #501

Closed
wants to merge 6 commits into from

Conversation

kylebarron
Copy link

@kylebarron kylebarron commented Jul 23, 2024

Closes #498.

This is essentially just a vendor of these lines of code (and then downported to pyo3 0.20).

@kylebarron kylebarron changed the title Read from Arrow C Data interface Read Arrow C Stream from Arrow PyCapsule Interface Jul 23, 2024
@jonmmease
Copy link
Collaborator

Thanks @kylebarron, could you add a quick comment somewhere in the code for how we could replace the vendored code after updating arrow and pulling in pyo3-arrow?

Comment on lines 273 to 276
// pub fn from_arrow_c_stream(table: pyo3_arrow::PyTable) -> PyResult<Self> {
// let (batches, schema) = table.into_inner();
// Ok(VegaFusionTable::try_new(schema, batches)?)
// }
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only a couple lines of code with pyo3_arrow!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside: I never know whether there's consistency in the ecosystem between an ordering of (schema, batches) or (batches, schema) 🫠

@kylebarron
Copy link
Author

@jonmmease are you able to see why the CI checks failed? It just says "job failed" for me

@jonmmease
Copy link
Collaborator

I haven't had a chance to dig into it, but it's fine to leave things here. I'll finish it up for the next VegaFusion release

@jonmmease
Copy link
Collaborator

Quick update. I'm slogging through updating DataFusion and arrow-rs over in #504. Once that lands we can rebase this and drop the vendored code.

@kylebarron
Copy link
Author

Awesome. And I just published pyo3-arrow 0.2 last night

@jonmmease
Copy link
Collaborator

jonmmease commented Aug 15, 2024

@kylebarron I made some updates to this PR after updating arrow to version 52, and updating pyo3-arrow to 0.2.0. If you have a chance to see if what I have here makes sense, that would be great.

I tested this using Polars, and it seems to be working in general, but I ran into an issue immediately that string columns from Polars are failing to convert with:

C Data interface error: The datatype ""vu"" is still not supported in Rust implementation

Looks like arrow-rs isn't supporting the new UTF8View type in https://github.com/apache/arrow-rs/blob/ca5fe7df5ca0bdcfd6f732cc3f10da511b753c5f/arrow/src/datatypes/ffi.rs#L30. Does that track your understanding?

Seems that work may be needed in arrow-rs, but I don't think I can route Polars through this path until this support is added.


Update: Oh, this was merged two weeks ago in apache/arrow-rs#6171, but arrow-rs 52.2.0 was released 3 weeks ago. So we'll just need to wait a bit I guess. We can still merge this as a fallback behind the dataframe protocol and the reorder later.

Comment on lines +276 to +277
pub fn from_arrow_c_stream(table: &Bound<PyAny>) -> PyResult<Self> {
let pytable = pyo3_arrow::PyTable::from_py_object_bound(table.as_borrowed())?;
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you keep this method the same to only support PyTable, PyTable implements FromPyObject, so you can make it the parameter directly.

Suggested change
pub fn from_arrow_c_stream(table: &Bound<PyAny>) -> PyResult<Self> {
let pytable = pyo3_arrow::PyTable::from_py_object_bound(table.as_borrowed())?;
pub fn from_arrow_c_stream(table: pyo3_arrow::PyTable) -> PyResult<Self> {

pub fn from_arrow_c_stream(table: &Bound<PyAny>) -> PyResult<Self> {
let pytable = pyo3_arrow::PyTable::from_py_object_bound(table.as_borrowed())?;
let (batches, schema) = pytable.into_inner();
Ok(VegaFusionTable::try_new(schema, batches)?)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I never know whether batches, schema or schema, batches is a better ordering 🥲

@@ -271,6 +272,13 @@ impl VegaFusionTable {
}
}

#[cfg(feature = "pyarrow")]
pub fn from_arrow_c_stream(table: &Bound<PyAny>) -> PyResult<Self> {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a public API or is this an internal API called by your Python code? If it's a public API, it might be preferred to overload the existing from_pyarrow method.

You can check for both the pycapsule interface and fall back to pyarrow in a single method. That's a recommended approach here.

You can do this by first trying to extract a PyTable; this will work for any recent pyarrow table object (or pyarrow record batch reader). And then you can fallback to your existing from_pyarrow code.

    pub fn from_arrow(table: &Bound<PyAny>) -> PyResult<Self> {
        if let Ok(table) = data.extract::<PyTable>() {
            let (batches, schema) = table.into_inner();
            Ok(VegaFusionTable::try_new(schema, batches)?)
        } else {
            Self::from_pyarrow(table)
        }
    }

@kylebarron
Copy link
Author

C Data interface error: The datatype ""vu"" is still not supported in Rust implementation

Ah yes, of course we'd hit this.

Correct, arrow-rs isn't supporting the new StringView type until arrow 53. (arrow-rs didn't merge any breaking changes in 52.1 or 52.2). I believe everything is set for this support in arrow 53 though (in september).

The PyCapsule Interface provides a mechanism to handle this: schema requests. It's a way for consumers to ask producers to cast their data export to a specific data type. I'd like to do a follow up PR to polars to implement this, but I haven't yet: pola-rs/polars#17676 (comment). So Polars will currently ignore schema requests.

So the two routes are either:

  • Wait for arrow 53
  • Implement schema requests in polars; wait for new polars release; implement schema request logic on the consumer side here (to ask for normal string types instead of string views)

@jonmmease
Copy link
Collaborator

Thanks for the context @kylebarron, I'm fine to wait for Arrow 53. But this will also require waiting for a DataFusion release based on Arrow 53, so I guess that could end up being a month later.

Separately, does the Arrow PyCapsule interface provide a way to query for whether extracting the arrow data is a cheap operation? Ibis is one scenario I have in mind. I believe Ibis supports the PyCapsule interface, but this must require running the SQL query and pulling down every possible column, which can be very expensive. For Ibis, it's much more efficient to request only the specific columns that we actually need. For a library that already has the Arrow representation materialized in memory, I would expect it to be cheaper to return the full representation rather than build a new representation that has only the needed columns.

It would be nice if there were a way to say, please return these columns, but it's fine to return more if that's cheaper.

@kylebarron
Copy link
Author

does the Arrow PyCapsule interface provide a way to query for whether extracting the arrow data is a cheap operation?

No it doesn't, but I think this largely comes into user semantics. So in the case of ibis, you might have an ibis Table representing a full SQL table. So passing that as arrow_func(ibis_table) will materialize the entire table, but the user can consciously call arrow_func(ibis_table.select(column_names)), which gives the user the ability to limit the amount of data collected and passed to arrow_func.

In my mind the interface is mostly around efficiently communicating between libraries that already have an Arrow representation

@kylebarron
Copy link
Author

this will also require waiting for a DataFusion release based on Arrow 53, so I guess that could end up being a month later.

Right. I would like to implement schema requests in polars regardless, but it's a lower priority than getting it implemented to start with.

@kylebarron
Copy link
Author

As an update here, lamentably arrow 53 supports StringView/BinaryView but not through C FFI, see apache/arrow-rs#6368. So if you're restricted to using a released version of arrow (e.g. not a git tag), we'll have to wait here until arrow 53.1

@jonmmease
Copy link
Collaborator

I think we are restricted to using a released crate because we publish the Rust crates to crates.io so that conda-forge can rebuild the sdist from source and I don't think this will work with a tagged dependency.

@kylebarron
Copy link
Author

If you publish to crates.io, then yes, you can only use released versions. I might be able to get around this in arro3 specifically by vendoring the relevant parts of the FFI fixes

@kylebarron
Copy link
Author

kylebarron commented Oct 6, 2024

Arrow 53.1 was released, which appears to fully support string view and binary view: kylebarron/arro3#219. I'd love to get this merged! Let me know if you need any help!

@jonmmease
Copy link
Collaborator

Awesome, I'm working on updating to DataFusion 42 and Arrow 53.1 over in #513. I've started a v2 branch and I'm planning to retarget this PR to v2 after merging that one.

For VegaFusion 2, I want to make pandas and pyarrow optional and this PR is a crucial part of that plan.

@jonmmease
Copy link
Collaborator

I started trying to port this to my v2 branch that's updated to arrow 53.1, but ran into a roadblock. The pyarrow feature in the arrow crate requires PyO3 0.22, which is what I've updated VegaFusion to use in v2. But it looks like pyo3-arrow requires PyO3 0.21.

Are there any blockers for you in updating to PyO3 0.22? I may be able to drop the arrow/pyarrow feature flag eventually, but there's more to figure out here.

In particular, some of VegaFusion's logic currently returns Arrow data from Rust to Python as pyarrow, and we'll need to think about what to do here in order for pyarrow to be an optional dependency. I suppose we could return the PyCapsule object from Rust and that create PyArrow or Polars from that.

@kylebarron
Copy link
Author

kylebarron commented Oct 7, 2024

Are there any blockers for you in updating to PyO3 0.22?

Yes. I need the numpy crate to support 0.22, which should happen any day now: PyO3/rust-numpy#442, PyO3/rust-numpy#435.

In particular, some of VegaFusion's logic currently returns Arrow data from Rust to Python as pyarrow, and we'll need to think about what to do here in order for pyarrow to be an optional dependency.

All of the relevant pyo3_arrow objects have a to_pyarrow method. Currently, this works in terms of the PyCapsule Interface, so it has a pretty high pyarrow minimum version (pyarrow v14+ for Table), but in theory we could implement the older C interface approach with pyarrow for older versions.

@jonmmease
Copy link
Collaborator

All of the relevant pyo3_arrow objects have a to_pyarrow method. Currently, this works in terms of the PyCapsule Interface, so it has a pretty high pyarrow minimum version (pyarrow v14+ for Table), but in theory we could implement the older C interface approach with pyarrow for older versions.

Ok, so I could remove the arrow/pyarrow feature flag and use these methods for the conversion back to pyarrow? I'm ok with the higher pyarrow minimum version since this will be part of a major version update in VegaFusion, and will hopefully be optional.

@kylebarron
Copy link
Author

Ok, so I could remove the arrow/pyarrow feature flag and use these methods for the conversion back to pyarrow?

Yep!

@kylebarron
Copy link
Author

Looks like it's scheduled for Wednesday: PyO3/rust-numpy#453

@kylebarron
Copy link
Author

I published pyo3-arrow 0.5 using pyo3 and numpy 0.22: https://crates.io/crates/pyo3-arrow

@jonmmease
Copy link
Collaborator

Starting to work on updating to pyo3-arrow 0.5, and it looks like the minimum Python version is now 3.11 (at least I get compilation errors with abi3-py310 and below). Is that intentional? I guess this is technically exactly in line with spec-0000, but Altair is generally a bit more generous here, following closer to the Python EOL schedule, which would still require Python 3.9 support.

Here are the compilation errors with abi3-py310

   Compiling pyo3-arrow v0.5.0
error[E0432]: unresolved import `pyo3::buffer`
   --> /Users/jonmmease/.cargo/registry/src/index.crates.io-6f17d22bba15001f/pyo3-arrow-0.5.0/src/buffer.rs:16:11
    |
16  | use pyo3::buffer::{ElementType, PyBuffer};
    |           ^^^^^^ could not find `buffer` in `pyo3`
    |
note: found an item that was configured out
   --> /Users/jonmmease/.cargo/registry/src/index.crates.io-6f17d22bba15001f/pyo3-0.22.4/src/lib.rs:442:9
    |
442 | pub mod buffer;
    |         ^^^^^^

error[E0412]: cannot find type `Py_buffer` in module `ffi`
  --> /Users/jonmmease/.cargo/registry/src/index.crates.io-6f17d22bba15001f/pyo3-arrow-0.5.0/src/buffer.rs:49:25
   |
49 |         view: *mut ffi::Py_buffer,
   |                         ^^^^^^^^^ not found in `ffi`

error[E0425]: cannot find function, tuple struct or tuple variant `PyBuffer_FillInfo` in module `ffi`
  --> /Users/jonmmease/.cargo/registry/src/index.crates.io-6f17d22bba15001f/pyo3-arrow-0.5.0/src/buffer.rs:54:28
   |
54 |             let ret = ffi::PyBuffer_FillInfo(
   |                            ^^^^^^^^^^^^^^^^^ not found in `ffi`

error[E0412]: cannot find type `Py_buffer` in module `ffi`
  --> /Users/jonmmease/.cargo/registry/src/index.crates.io-6f17d22bba15001f/pyo3-arrow-0.5.0/src/buffer.rs:71:75
   |
71 |     unsafe fn __releasebuffer__(mut slf: PyRefMut<Self>, _view: *mut ffi::Py_buffer) {
   |                                                                           ^^^^^^^^^ not found in `ffi`

error[E0412]: cannot find type `Py_buffer` in module `pyo3::ffi`
  --> /Users/jonmmease/.cargo/registry/src/index.crates.io-6f17d22bba15001f/pyo3-arrow-0.5.0/src/buffer.rs:34:1
   |
34 | #[pymethods]
   | ^^^^^^^^^^^^ not found in `pyo3::ffi`
   |
   = note: this error originates in the attribute macro `pymethods` (in Nightly builds, run with -Z macro-backtrace for more info)

error[E0425]: cannot find function `getbufferproc` in module `pyo3::impl_::trampoline`
  --> /Users/jonmmease/.cargo/registry/src/index.crates.io-6f17d22bba15001f/pyo3-arrow-0.5.0/src/buffer.rs:34:1
   |
34 | #[pymethods]
   | ^^^^^^^^^^^^ not found in `pyo3::impl_::trampoline`
   |
   = note: this error originates in the attribute macro `pymethods` (in Nightly builds, run with -Z macro-backtrace for more info)

error[E0412]: cannot find type `getbufferproc` in module `pyo3::ffi`
  --> /Users/jonmmease/.cargo/registry/src/index.crates.io-6f17d22bba15001f/pyo3-arrow-0.5.0/src/buffer.rs:34:1
   |
34 | #[pymethods]
   | ^^^^^^^^^^^^ not found in `pyo3::ffi`
   |
   = note: this error originates in the attribute macro `pymethods` (in Nightly builds, run with -Z macro-backtrace for more info)

error[E0425]: cannot find function `releasebufferproc` in module `pyo3::impl_::trampoline`
   --> /Users/jonmmease/.cargo/registry/src/index.crates.io-6f17d22bba15001f/pyo3-arrow-0.5.0/src/buffer.rs:34:1
    |
34  | #[pymethods]
    | ^^^^^^^^^^^^ not found in `pyo3::impl_::trampoline`
    |
note: found an item that was configured out
   --> /Users/jonmmease/.cargo/registry/src/index.crates.io-6f17d22bba15001f/pyo3-0.22.4/src/impl_/trampoline.rs:132:15
    |
132 | pub unsafe fn releasebufferproc(
    |               ^^^^^^^^^^^^^^^^^
    = note: this error originates in the attribute macro `pymethods` (in Nightly builds, run with -Z macro-backtrace for more info)

error[E0412]: cannot find type `releasebufferproc` in module `pyo3::ffi`
  --> /Users/jonmmease/.cargo/registry/src/index.crates.io-6f17d22bba15001f/pyo3-arrow-0.5.0/src/buffer.rs:34:1
   |
34 | #[pymethods]
   | ^^^^^^^^^^^^ not found in `pyo3::ffi`
   |
   = note: this error originates in the attribute macro `pymethods` (in Nightly builds, run with -Z macro-backtrace for more info)


@kylebarron
Copy link
Author

The minimum Python version does not need to be 3.11. You have three options:

  • use abi3-py311
  • Remove any abi3* flag
  • Take off the buffer_protocol feature flag with no-default-features

pyo3-arrow now supports buffer-protocol input (like numpy arrays) with zero copy, as long as they are C contiguous. But the buffer protocol only became part of Python's ABI-stable subset as of Python 3.11.

Because of this, arro3 builds per-Python version wheels now. https://pypi.org/project/arro3-core/#files

@jonmmease
Copy link
Collaborator

Thanks for the context @kylebarron! When I disable the default features, I'm seeing this build error (regardless of Python version).

   Compiling pyo3-arrow v0.5.0
error[E0432]: unresolved import `crate::PyArrowBuffer`
  --> /Users/jonmmease/.cargo/registry/src/index.crates.io-6f17d22bba15001f/pyo3-arrow-0.5.0/src/array.rs:31:13
   |
31 | use crate::{PyArrowBuffer, PyDataType, PyField};
   |             ^^^^^^^^^^^^^ no `PyArrowBuffer` in the root
   |
note: found an item that was configured out
  --> /Users/jonmmease/.cargo/registry/src/index.crates.io-6f17d22bba15001f/pyo3-arrow-0.5.0/src/lib.rs:25:17
   |
25 | pub use buffer::PyArrowBuffer;
   |                 ^^^^^^^^^^^^^
   = note: the item is gated behind the `buffer_protocol` feature

For more information about this error, try `rustc --explain E0432`.
error: could not compile `pyo3-arrow` (lib) due to 1 previous error

It looks like the import here might need to be conditioned on the buffer_protocol. Does that make sense?

https://github.com/kylebarron/arro3/blob/523585bc95fd2a0ee010c5f47a798e808e8f388d/pyo3-arrow/src/array.rs#L31

@kylebarron
Copy link
Author

Ah I need to add a ci check for no default features :/

@kylebarron
Copy link
Author

I published pyo3-arrow 0.5.1 with a CI check to ensure --no-default-features compiles

@jonmmease
Copy link
Collaborator

merged over in #517. Thanks for all of the help on this @kylebarron, the combo of arro3 and narwhals + arrow PyCapsule support is really convenient.

@jonmmease jonmmease closed this Oct 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support ingesting objects that support the Arrow PyCapsule API
2 participants