-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Index on empty frame should be RangeIndex #52404
Comments
When I try to reproduce it for >>> import pandas as pd
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/audeering.local/hwierstorf/git/pandas/pandas/__init__.py", line 22, in <module>
from pandas.compat import is_numpy_dev as _is_numpy_dev # pyright: ignore # noqa:F401
File "/home/audeering.local/hwierstorf/git/pandas/pandas/compat/__init__.py", line 25, in <module>
from pandas.compat.numpy import (
File "/home/audeering.local/hwierstorf/git/pandas/pandas/compat/numpy/__init__.py", line 4, in <module>
from pandas.util.version import Version
File "/home/audeering.local/hwierstorf/git/pandas/pandas/util/__init__.py", line 2, in <module>
from pandas.util._decorators import ( # noqa:F401
File "/home/audeering.local/hwierstorf/git/pandas/pandas/util/_decorators.py", line 14, in <module>
from pandas._libs.properties import cache_readonly
File "/home/audeering.local/hwierstorf/git/pandas/pandas/_libs/__init__.py", line 16, in <module>
import pandas._libs.pandas_parser # noqa # isort: skip # type: ignore[reportUnusedImport]
ModuleNotFoundError: No module named 'pandas._libs.pandas_parser' |
I now followed https://pandas.pydata.org/docs/dev/development/contributing_environment.html#option-2-using-pip and was able to reproduce the issue on the >>> pd.__version__
'2.1.0.dev0+409.g5a1f280647'
>>> pd.DataFrame({}).columns
Index([], dtype='object') |
Hi, I would like to work on this issue. Can you assign this to me and share the details? |
thanks @hagenw for the report! it looks like this works for some initialisations but not others: In [8]: import pandas as pd
...: pd.DataFrame().axes
Out[8]: [RangeIndex(start=0, stop=0, step=1), RangeIndex(start=0, stop=0, step=1)]
In [9]: import pandas as pd
...: pd.DataFrame([]).axes
Out[9]: [RangeIndex(start=0, stop=0, step=1), RangeIndex(start=0, stop=0, step=1)]
In [10]: import pandas as pd
...: pd.DataFrame({}).axes
Out[10]: [RangeIndex(start=0, stop=0, step=1), Index([], dtype='object')] cc @topper-123 |
This was intentional on my part when I made #49572. @mroeschke asked in a comment:
I argued there that that for a dict Also notice that non-empty dict can never give a RangeIndex. But IDK, maybe this just trips people up and it would be better to have empty dicts to give a |
thanks for explaining! I think your explanation makes sense, personally I think it'd be fine to keep as-is |
Would special casing be necessary to make |
It's very easy to change, so it's more a question of what we want. I can follow the thought that this can be a bit surprising. |
I would support returning a RangeIndex on both axes to match user expectation and an opportunity to avoid introducing an |
agree with @mroeschke This is confusing for most users. |
I've made a PR about this. |
On the other hand, I personally found it confusing to get a RangeIndex for columns, and I actually want to avoid introducing an int64 axis for the columns (if you otherwise always use string column names, using object dtype for an empty columns object is closer to what you want than an int64) Anyway, I don't necessarily object the change (consistency with other variants of initialization also has its value), but just wanted to point out that "confusing" / "user expectation" depends quite a bit on your use case (as usual ;)). |
Small correction: if the pyarrow Table came from a pandas DataFrame roundtrip originally, we actually store in the pandas metadata the dtype of the columns object, and use that information to correctly "restore" the column names. We don't know that it was a RangeIndex though, so if using this information, it comes back as an empty Index[int64]. When there is no pandas metadata, then we will use empty object dtype Index. |
I agreed with you initially, but when I had to explain it it sounded maybe more complex than I expected. But I could personally live with both I think they each have their advantages, and I now like the explanation "an empty axes on empty data is always a RangeIndex"... |
…nge in pandas 2.0.1 (#35031) ### Rationale for this change Pandas changed the default dtype of the columns object for an empty DataFrame from object dtype to integer RangeIndex (see pandas-dev/pandas#52404). This updates our tests to pass with that change. * Closes: #15070 Authored-by: Joris Van den Bossche <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
…pe change in pandas 2.0.1 (apache#35031) ### Rationale for this change Pandas changed the default dtype of the columns object for an empty DataFrame from object dtype to integer RangeIndex (see pandas-dev/pandas#52404). This updates our tests to pass with that change. * Closes: apache#15070 Authored-by: Joris Van den Bossche <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
…pe change in pandas 2.0.1 (apache#35031) ### Rationale for this change Pandas changed the default dtype of the columns object for an empty DataFrame from object dtype to integer RangeIndex (see pandas-dev/pandas#52404). This updates our tests to pass with that change. * Closes: apache#15070 Authored-by: Joris Van den Bossche <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
…pe change in pandas 2.0.1 (apache#35031) ### Rationale for this change Pandas changed the default dtype of the columns object for an empty DataFrame from object dtype to integer RangeIndex (see pandas-dev/pandas#52404). This updates our tests to pass with that change. * Closes: apache#15070 Authored-by: Joris Van den Bossche <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
The above code returns
Expected Behavior
But as stated in https://pandas.pydata.org/docs/dev/whatsnew/v2.0.0.html#empty-dataframes-series-will-now-default-to-have-a-rangeindex it should return instead:
which it does for
Installed Versions
INSTALLED VERSIONS
commit : 478d340
python : 3.8.16.final.0
python-bits : 64
OS : Linux
OS-release : 5.4.0-144-generic
Version : #161~18.04.1-Ubuntu SMP Fri Feb 10 15:55:22 UTC 2023
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8
pandas : 2.0.0
numpy : 1.24.2
pytz : 2023.3
dateutil : 2.8.2
setuptools : 67.5.1
pip : 23.0.1
Cython : None
pytest : None
hypothesis : None
sphinx : 6.1.3
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.10.1
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None
The text was updated successfully, but these errors were encountered: