Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Constructing from an unbound sequence doesn't raise/hangs #13049

Closed
shwina opened this issue Apr 3, 2023 · 6 comments · Fixed by #13799
Closed

[BUG] Constructing from an unbound sequence doesn't raise/hangs #13049

shwina opened this issue Apr 3, 2023 · 6 comments · Fixed by #13799
Assignees
Labels
bug Something isn't working Python Affects Python cuDF API.

Comments

@shwina
Copy link
Contributor

shwina commented Apr 3, 2023

The following snippet never completes:

In [1]: import cudf
In [2]: class A:
   ...:     def __getitem__(self, key):
   ...:         return 3
   ...: 
In [3] cudf.Series(A())

In Pandas, a Series of object data type is returned:

In [4]: pd.Series(A())
Out[4]: 
0    <__main__.A object at 0x7f0a946afd90>
dtype: object

At the very least, I think cuDF should be able to detect and raise (ValueError) in this pathological case.

@shwina shwina added bug Something isn't working Needs Triage Need team to review and classify python Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify python labels Apr 3, 2023
@bdice
Copy link
Contributor

bdice commented Apr 3, 2023

Is the title of this issue accurate? Unbound sequences also cause pandas to fail:

import pandas as pd

def count_forever():
    i = 1
    while True:
        yield i
        i +=1

pd.Series(count_forever())  # Does not return.

I believe this is actually a special case where __getitem__ is being called by cudf but not pandas, which simply uses the object value as a singleton item in the same way as pd.Series(3) returns the same as pd.Series([3]).

pandas seems to get stuck in the function maybe_iterable_to_list, which should give us some insight into how these cases are handled differently: https://github.com/pandas-dev/pandas/blob/e6da6a6a177597490e1f628b6bbe0c848986bd8f/pandas/core/common.py#L297

The class A above is not an abc.Iterable because it does not implement __iter__ (only __getitem__), nor is it an abc.Sized or abc.Sequence because it lacks __len__. count_forever is iterable but not sized.

@shwina
Copy link
Contributor Author

shwina commented Apr 5, 2023

Thanks, @bdice. Happy to chnage the title to something else. I used "unbound sequence" loosely here to refer to an object that is effectively a Sequence minus __len__.

@bdice
Copy link
Contributor

bdice commented Apr 5, 2023

I think the issue is a bit similar to #13056. I think this should be treated as an issue with constructing a Series from a scalar (here, the scalar is an instance of an object that defines getitem!) and not an issue with unbound data. Here, the getitem method should never be called and we should error at the point that instances of A have no supported GPU dtype.

Getting into the question of “is this iterable unbound” feels like solving the halting problem. 😅 I don’t think we must be any smarter than pandas at this task. If an infinite loop is provided, that is a user problem.

@shwina
Copy link
Contributor Author

shwina commented Apr 5, 2023

I don’t think we must be any smarter than pandas at this task.

This is the problem :( Pandas does return here. And since we cannot, we should probably raise.

@bdice
Copy link
Contributor

bdice commented Apr 5, 2023

Yes. And it should be an “incompatible type” error in your example, not an “unbound sequence” error.

@bdice
Copy link
Contributor

bdice commented Apr 6, 2023

I discussed offline with @shwina. We concluded we should check that the input passes isinstance(x, Iterable) or isinstance(x, Sequence) before attempting to make a column in pyarrow. Also note that pandas.api.types.infer_dtype(A()) gives an infinite loop.

We also found a bug in PyArrow which I reported upstream: apache/arrow#34944

@galipremsagar galipremsagar self-assigned this Jul 21, 2023
rapids-bot bot pushed a commit that referenced this issue Aug 2, 2023
Fixes: #13049 
This PR allows errors from pyarrow to be propagated when an un-bounded sequence is passed to `pa.array` constructor.

Authors:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Bradley Dice (https://github.com/bdice)

URL: #13799
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants