Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Index constructor should not allow an ndarray with ndim > 2 #27125

Closed
jschendel opened this issue Jun 29, 2019 · 2 comments · Fixed by #30588
Closed

BUG: Index constructor should not allow an ndarray with ndim > 2 #27125

jschendel opened this issue Jun 29, 2019 · 2 comments · Fixed by #30588
Labels
Bug Constructors Series/DataFrame/Index/pd.array Constructors Index Related to the Index class or subclasses
Milestone

Comments

@jschendel
Copy link
Member

Code Sample, a copy-pastable example if possible

On master:

In [1]: import numpy as np; import pandas as pd; pd.__version__
Out[1]: '0.25.0.dev0+833.gad18ea35b'

In [2]: pd.Index(np.arange(8).reshape(2, 2, 2))
Out[2]: Int64Index([[[0, 1], [2, 3]], [[4, 5], [6, 7]]], dtype='int64')

If the first dimension is greater than 2 it appears to flatten but does not actually do so:

In [3]: pd.Index(np.arange(12).reshape(3, 2, 2))
Out[3]: Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], dtype='int64')

In [4]: _.values
Out[4]: 
array([[[ 0,  1],
        [ 2,  3]],

       [[ 4,  5],
        [ 6,  7]],

       [[ 8,  9],
        [10, 11]]])

Problem description

The Index constructor accepts ndarrays with ndim > 2 and will even convert them to specialized subclasses, e.g. Int64Index.

Expected Output

I'd expect the operations above to raise, or at the very least should result in an object dtype Index, though I'd prefer to raise.

xref #17246

Output of pd.show_versions()

INSTALLED VERSIONS

commit : ad18ea3
python : 3.7.3.final.0
python-bits : 64
OS : Linux
OS-release : 4.19.14-041914-generic
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 0.25.0.dev0+833.gad18ea35b
numpy : 1.16.4
pytz : 2019.1
dateutil : 2.8.0
pip : 19.1.1
setuptools : 40.8.0
Cython : 0.29.10
pytest : 4.6.2
hypothesis : 4.23.6
sphinx : 1.8.5
blosc : None
feather : None
xlsxwriter : 1.1.8
lxml.etree : 4.3.3
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.10.1
IPython : 7.5.0
pandas_datareader: None
bs4 : 4.7.1
bottleneck : 1.2.1
fastparquet : 0.3.0
gcsfs : None
lxml.etree : 4.3.3
matplotlib : 3.1.0
numexpr : 2.6.9
openpyxl : 2.6.2
pandas_gbq : None
pyarrow : 0.11.1
pytables : None
s3fs : 0.2.1
scipy : 1.2.1
sqlalchemy : 1.3.4
tables : 3.5.2
xarray : 0.12.1
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.1.8

@jschendel jschendel added Bug Index Related to the Index class or subclasses labels Jun 29, 2019
@jschendel jschendel added this to the Contributions Welcome milestone Jun 29, 2019
@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jul 2, 2019

@jschendel this a problem for n==2 as well right? Not just >2? e.g. this looks wrong

edit: sorry this first example is from 0.24.2. Ignore it.

# edit: ignore this
In [21]: pd.Index(np.atleast_2d(np.array([0, 1, 2], dtype=int)).T)
Out[21]: Int64Index([[0], [1], [2]], dtype='int64')

And we had a behavior change for datetime (and probably timedelta) data

# 0.24.2
In [23]: pd.Index(np.atleast_2d(np.array([], dtype='datetime64[ns]')).T)
Out[23]: DatetimeIndex([], dtype='datetime64[ns]', freq=None)
# master
In [2]: pd.Index(np.atleast_2d(np.array([], dtype='datetime64[ns]')).T)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-2-dde438226c9f> in <module>
----> 1 pd.Index(np.atleast_2d(np.array([], dtype='datetime64[ns]')).T)

~/sandbox/pandas/pandas/core/indexes/base.py in __new__(cls, data, dtype, copy, name, fastpath, tupleize_cols, **kwargs)
    293             else:
    294                 result = DatetimeIndex(data, copy=copy, name=name,
--> 295                                        dtype=dtype, **kwargs)
    296                 return result
    297

~/sandbox/pandas/pandas/core/indexes/datetimes.py in __new__(cls, data, freq, start, end, periods, tz, normalize, closed, ambiguous, dayfirst, yearfirst, dtype, copy, name, verify_integrity)
    298
    299         subarr = cls._simple_new(dtarr, name=name,
--> 300                                  freq=dtarr.freq, tz=dtarr.tz)
    301         return subarr
    302

~/sandbox/pandas/pandas/core/indexes/datetimes.py in _simple_new(cls, values, name, freq, tz, dtype)
    314                 dtype = _NS_DTYPE
    315
--> 316             values = DatetimeArray(values, freq=freq, dtype=dtype)
    317             tz = values.tz
    318             freq = values.freq

~/sandbox/pandas/pandas/core/arrays/datetimes.py in __init__(self, values, dtype, freq, copy)
    311             raise ValueError(msg.format(type(values).__name__))
    312         if values.ndim != 1:
--> 313             raise ValueError("Only 1-dimensional input arrays are supported.")
    314
    315         if values.dtype == 'i8':

ValueError: Only 1-dimensional input arrays are supported.

We'll at least want a release note for the breaking change. We probably want to raise there, right @jbrockmendel? Maybe put in a deprecation if we have time (can do after the RC)

@jorisvandenbossche
Copy link
Member

Related to the explicit passing of 2D data to the Index constructor, there is also right now another case where you end up with such "invalid 2D Index" object: with an indexing operation (see #27775).

In [55]: idx = pd.Index([1, 2, 3])                                                                                                                                                            

In [56]: idx[:, None]                                                                                                                                                                         
Out[56]: Int64Index([1, 2, 3], dtype='int64')

In [57]: idx[:, None].values                                                                                                                                                                  
Out[57]: 
array([[1],
       [2],
       [3]])

So for this case we also need to decide what to do. Related discussion here: #27775 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Constructors Series/DataFrame/Index/pd.array Constructors Index Related to the Index class or subclasses
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants