Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pyarrow] Should Table.from_pandas default to preserve_index=False? #2244

Closed
danielcompton opened this issue Jul 10, 2018 · 2 comments
Closed

Comments

@danielcompton
Copy link
Contributor

danielcompton commented Jul 10, 2018

I spent a bit of time tracking down why an __index_level_0__ columns was being written to our Parquet files. Eventually we found that it was coming from the from_pandas method on Table. There is an optional parameter preserve_index which defaults to True. Setting it to False dropped the index on transfer and led to drastically smaller file sizes (4MB -> 1MB, 2M rows).

table = pa.Table.from_pandas(df, preserve_index=False)

def from_pandas(cls, df, Schema schema=None, bint preserve_index=True,
nthreads=None, columns=None):

Would it make sense to not preserve the index by default? In many cases (but maybe not all?) the index is not desired and bloats Parquet files. Presumably it would have a similar impact in-memory in the Arrow format.

@wesm
Copy link
Member

wesm commented Jul 10, 2018

It really depends on the user. We selected the defaults to faithfully store the data represented by the pandas DataFrame, so my view is that we should leave the default as is. This issue could probably be better highlighted in the Python documentation, though

We could possibly implement an optimization for the case of simple RangeIndex (similar to https://issues.apache.org/jira/browse/ARROW-1639) to reduce the storage footprint

danielcompton added a commit to danielcompton/arrow that referenced this issue Jul 11, 2018
Add documentation on removing the dataframe index when building a Parquet file.

Fixes apache#2244
@xhochy
Copy link
Member

xhochy commented Jul 11, 2018

We should definitely preserve the index when we go from/to Pandas by default. This is an essential part of the DataFrame and some users rely on that.

We could possibly implement an optimization for the case of simple RangeIndex (similar to https://issues.apache.org/jira/browse/ARROW-1639) to reduce the storage footprint

Once we implement one of the delta encodings for integer columns, these columns will be very small. There is then no need to an optimization especially for this Pandas columns but only need a good detection whether to use dictionary or delta encoding on a column.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants