[pyarrow] Should Table.from_pandas default to preserve_index=False? #2244

danielcompton · 2018-07-10T23:12:09Z

I spent a bit of time tracking down why an __index_level_0__ columns was being written to our Parquet files. Eventually we found that it was coming from the from_pandas method on Table. There is an optional parameter preserve_index which defaults to True. Setting it to False dropped the index on transfer and led to drastically smaller file sizes (4MB -> 1MB, 2M rows).

table = pa.Table.from_pandas(df, preserve_index=False)

arrow/python/pyarrow/table.pxi

Lines 949 to 950 in 8c9890c

    
           def from_pandas(cls, df, Schema schema=None, bint preserve_index=True, 
        
                           nthreads=None, columns=None):

Would it make sense to not preserve the index by default? In many cases (but maybe not all?) the index is not desired and bloats Parquet files. Presumably it would have a similar impact in-memory in the Arrow format.

The text was updated successfully, but these errors were encountered:

wesm · 2018-07-10T23:36:19Z

It really depends on the user. We selected the defaults to faithfully store the data represented by the pandas DataFrame, so my view is that we should leave the default as is. This issue could probably be better highlighted in the Python documentation, though

We could possibly implement an optimization for the case of simple RangeIndex (similar to https://issues.apache.org/jira/browse/ARROW-1639) to reduce the storage footprint

Add documentation on removing the dataframe index when building a Parquet file. Fixes apache#2244

xhochy · 2018-07-11T11:25:53Z

We should definitely preserve the index when we go from/to Pandas by default. This is an essential part of the DataFrame and some users rely on that.

We could possibly implement an optimization for the case of simple RangeIndex (similar to https://issues.apache.org/jira/browse/ARROW-1639) to reduce the storage footprint

Once we implement one of the delta encodings for integer columns, these columns will be very small. There is then no need to an optimization especially for this Pandas columns but only need a good detection whether to use dictionary or delta encoding on a column.

danielcompton added a commit to danielcompton/arrow that referenced this issue Jul 11, 2018

Update pyarrow docs to remove index when creating Parquet

84a4cbb

Add documentation on removing the dataframe index when building a Parquet file. Fixes apache#2244

danielcompton mentioned this issue Jul 11, 2018

ARROW-2861: [Python] Add tips about storing pandas DataFrame without index #2248

Closed

wesm closed this as completed Jul 19, 2018

icefed mentioned this issue Apr 23, 2020

Panic when read parquet file with field "__index_level_0__" xitongsys/parquet-go#233

Closed

lfdversluis mentioned this issue Jul 10, 2020

Cannot read parquet file from Pandas databricks/koalas#1645

Closed

amCap1712 mentioned this issue Aug 15, 2021

Parquet spark dumps metabrainz/listenbrainz-server#1545

Merged

rickspencer3 mentioned this issue Sep 2, 2024

avoid __index_level_0 errors rickspencer3/shoots#39

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pyarrow] Should Table.from_pandas default to preserve_index=False? #2244

[pyarrow] Should Table.from_pandas default to preserve_index=False? #2244

danielcompton commented Jul 10, 2018 •

edited

Loading

wesm commented Jul 10, 2018

xhochy commented Jul 11, 2018

[pyarrow] Should Table.from_pandas default to preserve_index=False? #2244

[pyarrow] Should Table.from_pandas default to preserve_index=False? #2244

Comments

danielcompton commented Jul 10, 2018 • edited Loading

wesm commented Jul 10, 2018

xhochy commented Jul 11, 2018

danielcompton commented Jul 10, 2018 •

edited

Loading