BUG: Index on empty frame should be RangeIndex #52404

hagenw · 2023-04-04T11:48:19Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
pd.DataFrame({}).columns

Issue Description

The above code returns

Index([], dtype='object')

Expected Behavior

But as stated in https://pandas.pydata.org/docs/dev/whatsnew/v2.0.0.html#empty-dataframes-series-will-now-default-to-have-a-rangeindex it should return instead:

RangeIndex(start=0, stop=0, step=1)

which it does for

pd.DataFrame().columns
pd.DataFrame(None).columns
pd.DataFrame([]).columns
pd.DataFrame(()).columns

Installed Versions

INSTALLED VERSIONS

commit : 478d340
python : 3.8.16.final.0
python-bits : 64
OS : Linux
OS-release : 5.4.0-144-generic
Version : #161~18.04.1-Ubuntu SMP Fri Feb 10 15:55:22 UTC 2023
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8

pandas : 2.0.0
numpy : 1.24.2
pytz : 2023.3
dateutil : 2.8.2
setuptools : 67.5.1
pip : 23.0.1
Cython : None
pytest : None
hypothesis : None
sphinx : 6.1.3
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.10.1
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None

The text was updated successfully, but these errors were encountered:

hagenw · 2023-04-04T11:49:34Z

When I try to reproduce it for main I get:

>>> import pandas as pd
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/audeering.local/hwierstorf/git/pandas/pandas/__init__.py", line 22, in <module>
    from pandas.compat import is_numpy_dev as _is_numpy_dev  # pyright: ignore # noqa:F401
  File "/home/audeering.local/hwierstorf/git/pandas/pandas/compat/__init__.py", line 25, in <module>
    from pandas.compat.numpy import (
  File "/home/audeering.local/hwierstorf/git/pandas/pandas/compat/numpy/__init__.py", line 4, in <module>
    from pandas.util.version import Version
  File "/home/audeering.local/hwierstorf/git/pandas/pandas/util/__init__.py", line 2, in <module>
    from pandas.util._decorators import (  # noqa:F401
  File "/home/audeering.local/hwierstorf/git/pandas/pandas/util/_decorators.py", line 14, in <module>
    from pandas._libs.properties import cache_readonly
  File "/home/audeering.local/hwierstorf/git/pandas/pandas/_libs/__init__.py", line 16, in <module>
    import pandas._libs.pandas_parser  # noqa # isort: skip # type: ignore[reportUnusedImport]
ModuleNotFoundError: No module named 'pandas._libs.pandas_parser'

hagenw · 2023-04-04T12:02:01Z

I now followed https://pandas.pydata.org/docs/dev/development/contributing_environment.html#option-2-using-pip and was able to reproduce the issue on the main branch as well:

>>> pd.__version__
'2.1.0.dev0+409.g5a1f280647'
>>> pd.DataFrame({}).columns
Index([], dtype='object')

zmwaris1 · 2023-04-04T14:04:49Z

Hi, I would like to work on this issue. Can you assign this to me and share the details?

MarcoGorelli · 2023-04-04T15:47:51Z

thanks @hagenw for the report! it looks like this works for some initialisations but not others:

In [8]: import pandas as pd
   ...: pd.DataFrame().axes
Out[8]: [RangeIndex(start=0, stop=0, step=1), RangeIndex(start=0, stop=0, step=1)]

In [9]: import pandas as pd
   ...: pd.DataFrame([]).axes
Out[9]: [RangeIndex(start=0, stop=0, step=1), RangeIndex(start=0, stop=0, step=1)]

In [10]: import pandas as pd
    ...: pd.DataFrame({}).axes
Out[10]: [RangeIndex(start=0, stop=0, step=1), Index([], dtype='object')]

cc @topper-123

topper-123 · 2023-04-04T16:15:15Z

This was intentional on my part when I made #49572.

@mroeschke asked in a comment:

Why does an empty dict not produce RangeIndexes?

I argued there that that for a dict d, Series(d) is the most equivalent to Series(d.values(), index=d.keys()), which is equivalent to Series([], index=[]) for en empty dict, i.e. has an index with dtype object.

Also notice that non-empty dict can never give a RangeIndex.

But IDK, maybe this just trips people up and it would be better to have empty dicts to give a RangeIndex?

MarcoGorelli · 2023-04-04T16:25:05Z

thanks for explaining! I think your explanation makes sense, personally I think it'd be fine to keep as-is

mroeschke · 2023-04-04T16:44:14Z

Would special casing be necessary to make DataFrame({}) produce a RangeIndex on both axes? If not, it might be better to forgo semantics and align with user expectation of "empty"

topper-123 · 2023-04-04T16:54:01Z

It's very easy to change, so it's more a question of what we want. I can follow the thought that this can be a bit surprising.

mroeschke · 2023-04-04T16:56:21Z

I would support returning a RangeIndex on both axes to match user expectation and an opportunity to avoid introducing an object dtype axis

phofl · 2023-04-04T17:23:31Z

agree with @mroeschke

This is confusing for most users.

topper-123 · 2023-04-05T06:26:25Z

I've made a PR about this.

jorisvandenbossche · 2023-04-11T08:33:09Z

I would support returning a RangeIndex on both axes to match user expectation and an opportunity to avoid introducing an object dtype axis

On the other hand, I personally found it confusing to get a RangeIndex for columns, and I actually want to avoid introducing an int64 axis for the columns (if you otherwise always use string column names, using object dtype for an empty columns object is closer to what you want than an int64)

Anyway, I don't necessarily object the change (consistency with other variants of initialization also has its value), but just wanted to point out that "confusing" / "user expectation" depends quite a bit on your use case (as usual ;)).
Pyarrow will keep returning object dtype for empty columns (which I think makes most sense for pyarrow since our column names are always strings, and which follows from using pandas' Index([])

jorisvandenbossche · 2023-04-11T08:46:58Z

Pyarrow will keep returning object dtype for empty columns (which I think makes most sense for pyarrow since our column names are always strings,

Small correction: if the pyarrow Table came from a pandas DataFrame roundtrip originally, we actually store in the pandas metadata the dtype of the columns object, and use that information to correctly "restore" the column names. We don't know that it was a RangeIndex though, so if using this information, it comes back as an empty Index[int64]. When there is no pandas metadata, then we will use empty object dtype Index.

topper-123 · 2023-04-11T15:23:59Z

I agreed with you initially, but when I had to explain it it sounded maybe more complex than I expected. But I could personally live with both I think they each have their advantages, and I now like the explanation "an empty axes on empty data is always a RangeIndex"...

…nge in pandas 2.0.1 (#35031) ### Rationale for this change Pandas changed the default dtype of the columns object for an empty DataFrame from object dtype to integer RangeIndex (see pandas-dev/pandas#52404). This updates our tests to pass with that change. * Closes: #15070 Authored-by: Joris Van den Bossche <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>

…pe change in pandas 2.0.1 (apache#35031) ### Rationale for this change Pandas changed the default dtype of the columns object for an empty DataFrame from object dtype to integer RangeIndex (see pandas-dev/pandas#52404). This updates our tests to pass with that change. * Closes: apache#15070 Authored-by: Joris Van den Bossche <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>

hagenw added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 4, 2023

DeaMariaLeon added DataFrame DataFrame data structure Constructors Series/DataFrame/Index/pd.array Constructors and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 4, 2023

phofl added this to the 2.0.1 milestone Apr 4, 2023

topper-123 mentioned this issue Apr 4, 2023

API: Series/DataFrame from empty dict should have RangeIndex #52426

Merged

5 tasks

mroeschke closed this as completed in #52426 Apr 10, 2023

jorisvandenbossche mentioned this issue Apr 11, 2023

GH-15070: [Python][CI] Update pandas test for empty columns dtype change in pandas 2.0.1 apache/arrow#35031

Merged

ivirshup mentioned this issue May 5, 2023

BUG: Behavior change on DataFrame instantiation from 2.0.0 2.0.1 #53100

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Index on empty frame should be RangeIndex #52404

BUG: Index on empty frame should be RangeIndex #52404

hagenw commented Apr 4, 2023 •

edited

Loading

INSTALLED VERSIONS

hagenw commented Apr 4, 2023

hagenw commented Apr 4, 2023

zmwaris1 commented Apr 4, 2023

MarcoGorelli commented Apr 4, 2023

topper-123 commented Apr 4, 2023 •

edited

Loading

MarcoGorelli commented Apr 4, 2023

mroeschke commented Apr 4, 2023

topper-123 commented Apr 4, 2023

mroeschke commented Apr 4, 2023

phofl commented Apr 4, 2023

topper-123 commented Apr 5, 2023

jorisvandenbossche commented Apr 11, 2023

jorisvandenbossche commented Apr 11, 2023

topper-123 commented Apr 11, 2023

BUG: Index on empty frame should be RangeIndex #52404

BUG: Index on empty frame should be RangeIndex #52404

Comments

hagenw commented Apr 4, 2023 • edited Loading

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

INSTALLED VERSIONS

hagenw commented Apr 4, 2023

hagenw commented Apr 4, 2023

zmwaris1 commented Apr 4, 2023

MarcoGorelli commented Apr 4, 2023

topper-123 commented Apr 4, 2023 • edited Loading

MarcoGorelli commented Apr 4, 2023

mroeschke commented Apr 4, 2023

topper-123 commented Apr 4, 2023

mroeschke commented Apr 4, 2023

phofl commented Apr 4, 2023

topper-123 commented Apr 5, 2023

jorisvandenbossche commented Apr 11, 2023

jorisvandenbossche commented Apr 11, 2023

topper-123 commented Apr 11, 2023

hagenw commented Apr 4, 2023 •

edited

Loading

topper-123 commented Apr 4, 2023 •

edited

Loading