-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Arrow] Support producing an "arrow_array_stream" PyCapsule #13418
Conversation
Thanks! |
Merge pull request duckdb/duckdb#13433 from lnkuiper/jemalloc_32bit Merge pull request duckdb/duckdb#13418 from Tishj/produce_arrow_pycapsule
|
||
https://arrow.apache.org/docs/dev/format/CDataInterface/PyCapsuleInterface.html | ||
)"; | ||
m.def("__arrow_c_stream__", &DuckDBPyRelation::ToArrowCapsule, capsule_docs); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not entirely familiar with the syntax here, but my reading is that this method has no keyword arguments?
It would be good to add the requested_schema
keyword, even if you simply ignore it for now (which is fine, because the spec states that the handling of keyword is "best effort" anyway). But not having that keyword will give errors in consumers that pass that keyword (like pyarrow.table(..)
will always do, even if it is None)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, this does fail for pyarrow.table:
import duckdb
import pyarrow as pa
import polars as pl
df = pl.DataFrame({"a": [1, 2, 3, 4], "b": ["a", "b", "c", "d"]})
con = duckdb.connect()
sql = "SELECT * from df"
query = con.query(sql)
test = pa.table(query)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
File /Users/kyle/tmp/duckdb/tmp3.py:[1](https://file+.vscode-resource.vscode-cdn.net/Users/kyle/tmp/duckdb/tmp3.py:1)
----> 1 test = pa.table(query)
File ~/tmp/duckdb/.venv/lib/python3.11/site-packages/pyarrow/table.pxi:6009, in pyarrow.lib.table()
TypeError: __arrow_c_stream__(): incompatible function arguments. The following argument types are supported:
1. (self: duckdb.duckdb.DuckDBPyRelation) -> object
Invoked with: ┌───────┬─────────┐
│ a │ b │
│ int64 │ varchar │
├───────┼─────────┤
│ 1 │ a │
│ 2 │ b │
│ 3 │ c │
│ 4 │ d │
└───────┴─────────┘
, None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe https://github.com/duckdb/duckdb/pull/13802/files should fix this
@@ -412,6 +412,7 @@ class DuckDBPyRelation: | |||
def list(self, column: str, groups: str = ..., window_spec: str = ..., projected_columns: str = ...) -> DuckDBPyRelation: ... | |||
|
|||
def arrow(self, batch_size: int = ...) -> pyarrow.lib.Table: ... | |||
def __arrow_c_stream__(self) -> object: ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Along with @jorisvandenbossche 's comment, this should be updated to match the spec https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html#protocol-typehints
def __arrow_c_stream__(self, requested_schema: object | None = None) -> object: ...
This PR implements #10716
Through
DuckDBPyRelation.__arrow_c_stream__
we can now produce anarrow_array_stream
PyCapsule.Some things to note:
The ArrowArrayStream contains aQueryResult
, if this is aStreamQueryResult
and a new query is executed before the full stream has been exhausted the result will be invalidated and chunks can no longer be fetched from the stream anymore.This currently produces a materialized result, meaning it is standalone from the connection and will not be affected by running other queries - but that does mean it doesn't support larger than memory result sets