[Python] Add rename_columns to DataSet #36593

aersam · 2023-07-10T13:17:42Z

Describe the enhancement requested

Dataset has fewer methods than Table, which is fine, of course. However we often use rename_columns on Table and it would be really handy for us to have it on Dataset, too. I think it could easily be implemented using replace_schema

https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Dataset.html#pyarrow.dataset.Dataset

Component(s)

Python

aersam · 2023-08-02T13:28:51Z

Seems using replace_schema does not work. The dataset always uses those column names to query the parquet, meaning the column names must match the ones in physical files. What really is needed is a separation between physical column name and logical column name. This would be really great, especially since parquet is a bit limited in what column names are allowed.
The best would be to have a "column mapping" in the fragment which would map the schema column names to physical column names. This would allow making queries with parquets with different physical column for the same logical column name. I guess that's a bit complex regarding the filters... but still would be great.

If we'd want to abstract Apache Iceberg oder Delta Lake Tables with the dataset, this would be needed (both support such column mapping stuff)

davlee1972 · 2023-10-09T22:12:06Z

Shouldn't dataset() just take the same parameters as to_table()? The difference is that to_table() produces a materialized view while dataset() creates a logical view.. This would include columns whether they are a subset or computed/renamed from the original dataset..

I'm not sure why there is a dataset.scanner class.. It looks logical, but you can't perform a join with it against another dataset.

ion-elgreco · 2024-01-05T10:11:05Z

@wjones127 do you think this is something that can be added?

aersam added the Type: enhancement label Jul 10, 2023

github-actions bot added the Component: Python label Jul 10, 2023

westonpace changed the title ~~Add rename_columns to DataSet~~ [Python] Add rename_columns to DataSet Jul 11, 2023

aersam mentioned this issue Aug 2, 2023

Support column mapping delta-io/delta-rs#930

Open

aersam mentioned this issue Nov 7, 2023

Expose a python method to use RecordBatchReader instead of PyArrow Dataset delta-io/delta-rs#1814

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] Add rename_columns to DataSet #36593

[Python] Add rename_columns to DataSet #36593

aersam commented Jul 10, 2023

aersam commented Aug 2, 2023

davlee1972 commented Oct 9, 2023 •

edited

Loading

ion-elgreco commented Jan 5, 2024

[Python] Add rename_columns to DataSet #36593

[Python] Add rename_columns to DataSet #36593

Comments

aersam commented Jul 10, 2023

Describe the enhancement requested

Component(s)

aersam commented Aug 2, 2023

davlee1972 commented Oct 9, 2023 • edited Loading

ion-elgreco commented Jan 5, 2024

davlee1972 commented Oct 9, 2023 •

edited

Loading