-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[pyarrow] Should Table.from_pandas default to preserve_index=False? #2244
Comments
It really depends on the user. We selected the defaults to faithfully store the data represented by the pandas DataFrame, so my view is that we should leave the default as is. This issue could probably be better highlighted in the Python documentation, though We could possibly implement an optimization for the case of simple RangeIndex (similar to https://issues.apache.org/jira/browse/ARROW-1639) to reduce the storage footprint |
Add documentation on removing the dataframe index when building a Parquet file. Fixes apache#2244
We should definitely preserve the index when we go from/to Pandas by default. This is an essential part of the DataFrame and some users rely on that.
Once we implement one of the delta encodings for integer columns, these columns will be very small. There is then no need to an optimization especially for this Pandas columns but only need a good detection whether to use dictionary or delta encoding on a column. |
I spent a bit of time tracking down why an
__index_level_0__
columns was being written to our Parquet files. Eventually we found that it was coming from thefrom_pandas
method on Table. There is an optional parameterpreserve_index
which defaults to True. Setting it to False dropped the index on transfer and led to drastically smaller file sizes (4MB -> 1MB, 2M rows).arrow/python/pyarrow/table.pxi
Lines 949 to 950 in 8c9890c
Would it make sense to not preserve the index by default? In many cases (but maybe not all?) the index is not desired and bloats Parquet files. Presumably it would have a similar impact in-memory in the Arrow format.
The text was updated successfully, but these errors were encountered: