Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Add rename_columns to DataSet #36593

Open
aersam opened this issue Jul 10, 2023 · 3 comments
Open

[Python] Add rename_columns to DataSet #36593

aersam opened this issue Jul 10, 2023 · 3 comments

Comments

@aersam
Copy link

aersam commented Jul 10, 2023

Describe the enhancement requested

Dataset has fewer methods than Table, which is fine, of course. However we often use rename_columns on Table and it would be really handy for us to have it on Dataset, too. I think it could easily be implemented using replace_schema

https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Dataset.html#pyarrow.dataset.Dataset

Component(s)

Python

@westonpace westonpace changed the title Add rename_columns to DataSet [Python] Add rename_columns to DataSet Jul 11, 2023
@aersam
Copy link
Author

aersam commented Aug 2, 2023

Seems using replace_schema does not work. The dataset always uses those column names to query the parquet, meaning the column names must match the ones in physical files. What really is needed is a separation between physical column name and logical column name. This would be really great, especially since parquet is a bit limited in what column names are allowed.
The best would be to have a "column mapping" in the fragment which would map the schema column names to physical column names. This would allow making queries with parquets with different physical column for the same logical column name. I guess that's a bit complex regarding the filters... but still would be great.

If we'd want to abstract Apache Iceberg oder Delta Lake Tables with the dataset, this would be needed (both support such column mapping stuff)

@davlee1972
Copy link

davlee1972 commented Oct 9, 2023

Shouldn't dataset() just take the same parameters as to_table()? The difference is that to_table() produces a materialized view while dataset() creates a logical view.. This would include columns whether they are a subset or computed/renamed from the original dataset..

I'm not sure why there is a dataset.scanner class.. It looks logical, but you can't perform a join with it against another dataset.

@ion-elgreco
Copy link

@wjones127 do you think this is something that can be added?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants