Expose a python method to use RecordBatchReader instead of PyArrow Dataset #1814

aersam · 2023-11-07T08:56:16Z

Description

It's impossible to use PyArrow Dataset to represent Column Mapping (apache/arrow#36593), also Deletion Vectors are nothing to be represented in a Dataset. PyArrow Tables are more flexibel, but fully loaded into RAM. I think the correct abstraction would be a to_recordbatchreader() method on the Delta Tables which takes a (partition)filter parameter

Use Case

Future support for deletion vectors and column mapping

ion-elgreco · 2023-11-08T18:41:55Z

@aersam I don't see in the pyarrow docs how we can read recordbatches from storage with an RecordBatchReader. I only see a path from dataset.to_batches() and then build the reader with these batches

aersam · 2023-11-08T19:54:23Z

Well I would also recommend implementing it in Rust: https://arrow.apache.org/rust/arrow/record_batch/trait.RecordBatchReader.html

But the thing is that a RecordBatchReader can be constructed from anything, from either Rust or PyArrow. It's a very generic abstraction, it's only a Schema and an Iterator over RecordBatches

wjones127 · 2023-11-08T21:32:32Z

Hi @aersam. You are right that PyArrow datasets right now will be a dead end as we move to support deletion vectors, column mapping, and other new features. I've been meaning to define a new protocol that will allow us to expose something like a PyArrow Dataset, but that we can create a custom implementation of in Rust. This is tracked in apache/arrow#37504

In the near term though, it does seems like it might be appropriate to expose a method like:

def scan(
   self,
   columns: Optional[List[str]] = None,
   filter: Optional[???] = None,
) -> pa.RecordBatchReader:
    ...

And implement that with a Rust-based scanner that supports newer table features.

ion-elgreco · 2023-11-11T12:05:29Z

@wjones127 You mean this: https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Dataset.html#pyarrow.dataset.Dataset.scanner ?

wjones127 · 2023-11-11T20:03:57Z

@ion-elgreco That is implemented in C++ and something we don't have control over. But yes, it would have many similarities to that.

aersam added the enhancement New feature or request label Nov 7, 2023

ion-elgreco added the binding/python Issues for the Python package label Nov 22, 2023

ion-elgreco added this to the python v0.20 milestone Nov 26, 2023

rtyler removed this from the python v0.20 milestone Oct 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose a python method to use RecordBatchReader instead of PyArrow Dataset #1814

Expose a python method to use RecordBatchReader instead of PyArrow Dataset #1814

aersam commented Nov 7, 2023

ion-elgreco commented Nov 8, 2023

aersam commented Nov 8, 2023

wjones127 commented Nov 8, 2023

ion-elgreco commented Nov 11, 2023

wjones127 commented Nov 11, 2023

Expose a python method to use RecordBatchReader instead of PyArrow Dataset #1814

Expose a python method to use RecordBatchReader instead of PyArrow Dataset #1814

Comments

aersam commented Nov 7, 2023

Description

ion-elgreco commented Nov 8, 2023

aersam commented Nov 8, 2023

wjones127 commented Nov 8, 2023

ion-elgreco commented Nov 11, 2023

wjones127 commented Nov 11, 2023