BUG: Empty dataframe with valid index object is not being read correctly via parquet reader #37897

galipremsagar · 2020-11-16T18:18:42Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

In[38]: df = pd.DataFrame(index=pd.RangeIndex(0, 10, 1))
In[39]: df.to_parquet('a')
In[40]: pd.read_parquet('a')
Out[40]: 
Empty DataFrame
Columns: []
Index: []
In[41]: df.to_parquet('a', index=True)
In[42]: pd.read_parquet('a')
Out[42]: 
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
In[43]: pd.read_parquet('a').index
Out[43]: Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64')```

Problem description

When index is None/True parquet reader must retrieve only RangeIndex and not empty RangeIndex or Int64Index.

Expected Output

We should be able to get rangeIndex when we read from parquet file a.

Output of `pd.show_versions()`

[paste the output of `pd.show_versions()` here leaving a blank line after the details tag]
INSTALLED VERSIONS

commit : 67a3d42
python : 3.7.8.final.0
python-bits : 64
OS : Linux
OS-release : 5.4.0-53-generic
Version : #59-Ubuntu SMP Wed Oct 21 09:38:44 UTC 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.1.4
numpy : 1.19.4
pytz : 2020.4
dateutil : 2.8.1
pip : 20.2.4
setuptools : 49.6.0.post20201009
Cython : 0.29.21
pytest : 6.1.2
hypothesis : 5.41.1
sphinx : 3.3.0
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.19.0
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : 0.8.4
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 1.0.1
pytables : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : 0.51.2

This could be related to #37896, but it appears the parquet reader is also not able to retrieve any Index at all when index is None. Hence filing this as a separate issue.

The text was updated successfully, but these errors were encountered:

jreback · 2020-11-16T18:54:18Z

likely a pyarrow issue
cc @jorisvandenbossche

jorisvandenbossche · 2020-11-16T20:07:09Z

(thanks for the ping, will take a look later this week)

jorisvandenbossche · 2020-11-18T13:49:57Z

It's indeed a pyarrow issue, but not directly related to Parquet itself, but the pandas <-> pyarrow roundtrip is already failing for this corner case:

In [33]: df = pd.DataFrame(index=pd.RangeIndex(0, 10, 1))

In [34]: df
Out[34]: 
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [35]: df.shape
Out[35]: (10, 0)

In [36]: table = pa.table(df)

In [37]: table.to_pandas()
Out[37]: 
Empty DataFrame
Columns: []
Index: []

In [38]: table.to_pandas().shape
Out[38]: (0, 0)

I opened https://issues.apache.org/jira/browse/ARROW-10643 for this, contributions to fix this are always welcome!

Since it is an issue on the pyarrow side, closing this.

wence- · 2022-11-25T16:30:07Z

While the underlying pyarrow issue was fixed. The bug in the original report persists. to_parquet/read_parquet do not roundtrip RangeIndex correctly. I suspect this is because to_parquet doesn't preserve range indices:

import pyarrow as pa
import pandas as pd
from io import BytesIO

buf = BytesIO()

df = pd.DataFrame({"a": [1, 2, 3]}, index=pd.RangeIndex(0, 3))
df.to_parquet(buf, index=True)
table = pa.parquet.read_table(buf)

pa_table = pa.table(df)

assert table == pa_table # False

wence- · 2024-03-22T14:43:27Z

parquet case is apache/arrow#40743

galipremsagar added Bug Needs Triage labels Nov 16, 2020

jorisvandenbossche added IO Parquet and removed Needs Triage labels Nov 16, 2020

jorisvandenbossche closed this as completed Nov 18, 2020

jorisvandenbossche added this to the No action milestone Nov 18, 2020

wence- mentioned this issue Nov 25, 2022

[BUG] read_parquet/to_parquet don't handle empty dataframes correctly rapidsai/cudf#12243

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Empty dataframe with valid index object is not being read correctly via parquet reader #37897

BUG: Empty dataframe with valid index object is not being read correctly via parquet reader #37897

galipremsagar commented Nov 16, 2020

[paste the output of `pd.show_versions()` here leaving a blank line after the details tag]
INSTALLED VERSIONS

jreback commented Nov 16, 2020

jorisvandenbossche commented Nov 16, 2020

jorisvandenbossche commented Nov 18, 2020

wence- commented Nov 25, 2022 •

edited

Loading

wence- commented Mar 22, 2024

BUG: Empty dataframe with valid index object is not being read correctly via parquet reader #37897

BUG: Empty dataframe with valid index object is not being read correctly via parquet reader #37897

Comments

galipremsagar commented Nov 16, 2020

Code Sample, a copy-pastable example

Problem description

Expected Output

Output of pd.show_versions()

[paste the output of pd.show_versions() here leaving a blank line after the details tag] INSTALLED VERSIONS

jreback commented Nov 16, 2020

jorisvandenbossche commented Nov 16, 2020

jorisvandenbossche commented Nov 18, 2020

wence- commented Nov 25, 2022 • edited Loading

wence- commented Mar 22, 2024

Output of `pd.show_versions()`

[paste the output of `pd.show_versions()` here leaving a blank line after the details tag]
INSTALLED VERSIONS

wence- commented Nov 25, 2022 •

edited

Loading