-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python] Add rename_columns to DataSet #36593
Comments
Seems using replace_schema does not work. The dataset always uses those column names to query the parquet, meaning the column names must match the ones in physical files. What really is needed is a separation between physical column name and logical column name. This would be really great, especially since parquet is a bit limited in what column names are allowed. If we'd want to abstract Apache Iceberg oder Delta Lake Tables with the dataset, this would be needed (both support such column mapping stuff) |
Shouldn't dataset() just take the same parameters as to_table()? The difference is that to_table() produces a materialized view while dataset() creates a logical view.. This would include columns whether they are a subset or computed/renamed from the original dataset.. I'm not sure why there is a dataset.scanner class.. It looks logical, but you can't perform a join with it against another dataset. |
@wjones127 do you think this is something that can be added? |
Describe the enhancement requested
Dataset has fewer methods than Table, which is fine, of course. However we often use rename_columns on Table and it would be really handy for us to have it on Dataset, too. I think it could easily be implemented using replace_schema
https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Dataset.html#pyarrow.dataset.Dataset
Component(s)
Python
The text was updated successfully, but these errors were encountered: