-
Notifications
You must be signed in to change notification settings - Fork 414
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expose a python method to use RecordBatchReader instead of PyArrow Dataset #1814
Comments
@aersam I don't see in the pyarrow docs how we can read recordbatches from storage with an RecordBatchReader. I only see a path from |
Well I would also recommend implementing it in Rust: https://arrow.apache.org/rust/arrow/record_batch/trait.RecordBatchReader.html But the thing is that a RecordBatchReader can be constructed from anything, from either Rust or PyArrow. It's a very generic abstraction, it's only a Schema and an Iterator over RecordBatches |
Hi @aersam. You are right that PyArrow datasets right now will be a dead end as we move to support deletion vectors, column mapping, and other new features. I've been meaning to define a new protocol that will allow us to expose something like a PyArrow Dataset, but that we can create a custom implementation of in Rust. This is tracked in apache/arrow#37504 In the near term though, it does seems like it might be appropriate to expose a method like: def scan(
self,
columns: Optional[List[str]] = None,
filter: Optional[???] = None,
) -> pa.RecordBatchReader:
... And implement that with a Rust-based scanner that supports newer table features. |
@ion-elgreco That is implemented in C++ and something we don't have control over. But yes, it would have many similarities to that. |
Description
It's impossible to use PyArrow Dataset to represent Column Mapping (apache/arrow#36593), also Deletion Vectors are nothing to be represented in a Dataset. PyArrow Tables are more flexibel, but fully loaded into RAM. I think the correct abstraction would be a to_recordbatchreader() method on the Delta Tables which takes a (partition)filter parameter
Use Case
Future support for deletion vectors and column mapping
The text was updated successfully, but these errors were encountered: