Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Please consider checking ndarray types using the array interface #44617

Closed
3 tasks done
NeilGirdhar opened this issue Nov 25, 2021 · 14 comments
Closed
3 tasks done

Please consider checking ndarray types using the array interface #44617

NeilGirdhar opened this issue Nov 25, 2021 · 14 comments
Labels
Bug Needs Triage Issue that has not been reviewed by a pandas team member

Comments

@NeilGirdhar
Copy link

NeilGirdhar commented Nov 25, 2021

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the master branch of pandas.

Reproducible Example

.

Issue Description

Please see here for a description: jax-ml/jax#8701

In short, there are various places in Pandas where isinstance(x, np.ndarray) is used. This only checks subclasses. With Numpy arrays, Pandas would ideally check for the array interface (using __array__ and __array_interface__) so that Jax and other Numpy array types work.

Expected Behavior

.

Installed Versions

INSTALLED VERSIONS

commit : 945c9ed
python : 3.9.8.final.0
python-bits : 64
OS : Linux
OS-release : 5.11.0-40-generic
Version : #44~20.04.2-Ubuntu SMP Tue Oct 26 18:07:44 UTC 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_CA.UTF-8
LOCALE : en_CA.UTF-8

pandas : 1.3.4
numpy : 1.21.4
pytz : 2021.3
dateutil : 2.8.2
pip : 21.3.1
setuptools : 59.1.1
Cython : None
pytest : 6.2.5
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.3
IPython : 7.29.0
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.5.0
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.7.2
sqlalchemy : None
tables : None
tabulate : 0.8.9
xarray : None
xlrd : None
xlwt : None
numba : None

@NeilGirdhar NeilGirdhar added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 25, 2021
@NeilGirdhar
Copy link
Author

NeilGirdhar commented Nov 25, 2021

Maybe numpy should actually add an ABC

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Dec 1, 2021

cc @jbrockmendel

Also related issue: #44616 (but limited to just the DataFrame constructor)

@NeilGirdhar
Copy link
Author

NeilGirdhar commented Dec 1, 2021

@jorisvandenbossche Yes, that looks like the same issue. Probably, pandas isn't interpreting the torch tensor as a numpy array, so they would benefit from the ABC as well.

@jbrockmendel
Copy link
Member

Checking for an __array__ attribute in constructor-like functions isn't too hard. Checking for it everywhere seems pretty daunting, but I would have no problems with PRs doing this. The places in the cython code where we check util.is_array may be dicey, since in some of them we actually do rely on having an ndarray object.

@NeilGirdhar
Copy link
Author

How would you feel about adding an AbstractArray ABC like numpy/numpy#20459 and using it in the constructor-like functions? Maybe numpy will eventually upstream it, but it would solve the immediate problems in a somewhat elegant way?

@jbrockmendel
Copy link
Member

How would you feel about adding an AbstractArray ABC like numpy/numpy#20459 and using it in the constructor-like functions?

First thing that comes to mind is perf: for our internal ABCFoo classes isinstance(obj, ABCDataFrame) is about 2-3x slower than isinstance(obj, DataFrame) last I checked. The numbers are small, but it's worth being aware of.

Second is whether the idea would be to put your ndarray-like object directly in a Series/DataFrame or to first cast it to a for-realsies-ndarray. The former option seems liable to produce headaches.

@NeilGirdhar
Copy link
Author

Good points. I'm not sure what's best since I haven't done any experiments. It may be that the jax.numpy.ndarray types are clever enough to do the right thing even when they're in DataFrames.

@jbrockmendel
Copy link
Member

might be able to implement an ExtensionArray backed by one of these objects?

@NeilGirdhar
Copy link
Author

I'm not sure what you mean?

@NeilGirdhar
Copy link
Author

NeilGirdhar commented Dec 3, 2021

Given further discussion on numpy/numpy#20459, it appears that I was mistaken about using an ABC. Do you think it would be better to block the construction of Pandas arrays with jax arrays and pytorch tensors, etc.? In other words, if you detect something that's array-like, but not an np.ndarray, raise an exception?

I'm not sure, but maybe when the array API standard is complete and adopted, there may be a more elegant solution. https://data-apis.org/array-api/latest/

@jbrockmendel
Copy link
Member

might be able to implement an ExtensionArray backed by one of these objects?

I'm not sure what you mean?

ExtensionArrays (EAs) exist largely so that downstream libraries can implement subclasses and store non-np.ndarray objects directly in a Series/DataFrame. So if you don't want your arraylike object to be cast to an ndarray, implementing an EA is encouraged.

Do you think it would be better to block the construction of Pandas arrays with jax arrays and pytorch tensors, etc.? In other words, if you detect something that's array-like, but not an np.ndarray, raise an exception?

Well, no. Lots of other things fall into that category that currently work and we wouldn't want to raise on.

@NeilGirdhar
Copy link
Author

Okay, thanks! I'll close this then. I guess I'll have to be extra careful when I pass arrays that may be Jax arrays into Pandas.

@jbrockmendel
Copy link
Member

are the cases of interest passing them to the Series/DataFrame constructors? and is np.asarray(your_obj) efficient? if so, a fix suggested in #44616 will likely solve this

@NeilGirdhar
Copy link
Author

NeilGirdhar commented Dec 5, 2021

@jbrockmendel Yes, conversion to array is efficient. I commented on that issue though because one of the fixes of the two might work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Needs Triage Issue that has not been reviewed by a pandas team member
Projects
None yet
Development

No branches or pull requests

3 participants