[Arrow] Support producing an "arrow_array_stream" PyCapsule #13418

Tishj · 2024-08-14T14:12:12Z

This PR implements #10716

Through DuckDBPyRelation.__arrow_c_stream__ we can now produce an arrow_array_stream PyCapsule.

Some things to note:
The ArrowArrayStream contains a QueryResult, if this is a StreamQueryResult and a new query is executed before the full stream has been exhausted the result will be invalidated and chunks can no longer be fetched from the stream anymore.

This currently produces a materialized result, meaning it is standalone from the connection and will not be affected by running other queries - but that does mean it doesn't support larger than memory result sets

… method

Mytherin · 2024-08-15T16:20:03Z

Thanks!

Merge pull request duckdb/duckdb#13433 from lnkuiper/jemalloc_32bit Merge pull request duckdb/duckdb#13418 from Tishj/produce_arrow_pycapsule

jorisvandenbossche · 2024-09-06T07:50:25Z

tools/pythonpkg/src/pyrelation/initialize.cpp

+
+			https://arrow.apache.org/docs/dev/format/CDataInterface/PyCapsuleInterface.html
+		)";
+	m.def("__arrow_c_stream__", &DuckDBPyRelation::ToArrowCapsule, capsule_docs);


Not entirely familiar with the syntax here, but my reading is that this method has no keyword arguments?

It would be good to add the requested_schema keyword, even if you simply ignore it for now (which is fine, because the spec states that the handling of keyword is "best effort" anyway). But not having that keyword will give errors in consumers that pass that keyword (like pyarrow.table(..) will always do, even if it is None)

Indeed, this does fail for pyarrow.table:

import duckdb import pyarrow as pa import polars as pl df = pl.DataFrame({"a": [1, 2, 3, 4], "b": ["a", "b", "c", "d"]}) con = duckdb.connect() sql = "SELECT * from df" query = con.query(sql) test = pa.table(query)

--------------------------------------------------------------------------- TypeError Traceback (most recent call last) File /Users/kyle/tmp/duckdb/tmp3.py:[1](https://file+.vscode-resource.vscode-cdn.net/Users/kyle/tmp/duckdb/tmp3.py:1) ----> 1 test = pa.table(query) File ~/tmp/duckdb/.venv/lib/python3.11/site-packages/pyarrow/table.pxi:6009, in pyarrow.lib.table() TypeError: __arrow_c_stream__(): incompatible function arguments. The following argument types are supported: 1. (self: duckdb.duckdb.DuckDBPyRelation) -> object Invoked with: ┌───────┬─────────┐ │ a │ b │ │ int64 │ varchar │ ├───────┼─────────┤ │ 1 │ a │ │ 2 │ b │ │ 3 │ c │ │ 4 │ d │ └───────┴─────────┘ , None

I believe https://github.com/duckdb/duckdb/pull/13802/files should fix this

kylebarron · 2024-09-06T12:49:16Z

tools/pythonpkg/duckdb-stubs/__init__.pyi

@@ -412,6 +412,7 @@ class DuckDBPyRelation:
    def list(self, column: str, groups: str = ..., window_spec: str = ..., projected_columns: str = ...) -> DuckDBPyRelation: ...

    def arrow(self, batch_size: int = ...) -> pyarrow.lib.Table: ...
+    def __arrow_c_stream__(self) -> object: ...


Along with @jorisvandenbossche 's comment, this should be updated to match the spec https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html#protocol-typehints

def __arrow_c_stream__(self, requested_schema: object | None = None) -> object: ...

Tishj added 2 commits August 14, 2024 16:05

produce a record batch reader as a PyCapsule

ba64639

add stubs for arrow_capsule

7c10fcb

duckdb-draftbot marked this pull request as draft August 14, 2024 14:44

respect the interface, use the right name for the pycapsule producing…

2a28984

… method

Tishj marked this pull request as ready for review August 14, 2024 18:56

check for the existence of a result before importing pyarrow

f514043

duckdb-draftbot marked this pull request as draft August 15, 2024 09:26

Tishj marked this pull request as ready for review August 15, 2024 12:14

Mytherin merged commit c6ab646 into duckdb:main Aug 15, 2024
17 checks passed

jorisvandenbossche reviewed Sep 6, 2024

View reviewed changes

kylebarron reviewed Sep 6, 2024

View reviewed changes

kylebarron mentioned this pull request Sep 6, 2024

InvalidInputException: Attempting to execute an unsuccessful or closed pending query result #13793

Closed

2 tasks

kylebarron mentioned this pull request Oct 2, 2024

[Python] Promote usage of the Arrow PyCapsule Protocol (for the C Data Inteface) apache/arrow#39195

Open

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Arrow] Support producing an "arrow_array_stream" PyCapsule #13418

[Arrow] Support producing an "arrow_array_stream" PyCapsule #13418

Tishj commented Aug 14, 2024 •

edited

Loading

Mytherin commented Aug 15, 2024

jorisvandenbossche Sep 6, 2024

kylebarron Sep 6, 2024

WillAyd Sep 6, 2024

kylebarron Sep 6, 2024

[Arrow] Support producing an "arrow_array_stream" PyCapsule #13418

[Arrow] Support producing an "arrow_array_stream" PyCapsule #13418

Conversation

Tishj commented Aug 14, 2024 • edited Loading

Mytherin commented Aug 15, 2024

jorisvandenbossche Sep 6, 2024

Choose a reason for hiding this comment

kylebarron Sep 6, 2024

Choose a reason for hiding this comment

WillAyd Sep 6, 2024

Choose a reason for hiding this comment

kylebarron Sep 6, 2024

Choose a reason for hiding this comment

Tishj commented Aug 14, 2024 •

edited

Loading