Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose a python method to use RecordBatchReader instead of PyArrow Dataset #1814

Open
aersam opened this issue Nov 7, 2023 · 5 comments
Open
Labels
binding/python Issues for the Python package enhancement New feature or request

Comments

@aersam
Copy link
Contributor

aersam commented Nov 7, 2023

Description

It's impossible to use PyArrow Dataset to represent Column Mapping (apache/arrow#36593), also Deletion Vectors are nothing to be represented in a Dataset. PyArrow Tables are more flexibel, but fully loaded into RAM. I think the correct abstraction would be a to_recordbatchreader() method on the Delta Tables which takes a (partition)filter parameter

Use Case

Future support for deletion vectors and column mapping

@aersam aersam added the enhancement New feature or request label Nov 7, 2023
@ion-elgreco
Copy link
Collaborator

@aersam I don't see in the pyarrow docs how we can read recordbatches from storage with an RecordBatchReader. I only see a path from dataset.to_batches() and then build the reader with these batches

@aersam
Copy link
Contributor Author

aersam commented Nov 8, 2023

Well I would also recommend implementing it in Rust: https://arrow.apache.org/rust/arrow/record_batch/trait.RecordBatchReader.html

But the thing is that a RecordBatchReader can be constructed from anything, from either Rust or PyArrow. It's a very generic abstraction, it's only a Schema and an Iterator over RecordBatches

@wjones127
Copy link
Collaborator

Hi @aersam. You are right that PyArrow datasets right now will be a dead end as we move to support deletion vectors, column mapping, and other new features. I've been meaning to define a new protocol that will allow us to expose something like a PyArrow Dataset, but that we can create a custom implementation of in Rust. This is tracked in apache/arrow#37504

In the near term though, it does seems like it might be appropriate to expose a method like:

def scan(
   self,
   columns: Optional[List[str]] = None,
   filter: Optional[???] = None,
) -> pa.RecordBatchReader:
    ...

And implement that with a Rust-based scanner that supports newer table features.

@ion-elgreco
Copy link
Collaborator

@wjones127
Copy link
Collaborator

@ion-elgreco That is implemented in C++ and something we don't have control over. But yes, it would have many similarities to that.

@ion-elgreco ion-elgreco added the binding/python Issues for the Python package label Nov 22, 2023
@ion-elgreco ion-elgreco added this to the python v0.20 milestone Nov 26, 2023
@rtyler rtyler removed this from the python v0.20 milestone Oct 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/python Issues for the Python package enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants