Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Assignment of string list to column doesn't work #11944

Open
Tracked by #12793
Ullar-Kask opened this issue Oct 19, 2022 · 8 comments
Open
Tracked by #12793

[BUG] Assignment of string list to column doesn't work #11944

Ullar-Kask opened this issue Oct 19, 2022 · 8 comments
Assignees
Labels
bug Something isn't working Python Affects Python cuDF API.

Comments

@Ullar-Kask
Copy link

Describe the bug

This does not work:

df.loc[df['column'] =='value', 'column2'] = ['0','1']

TypeError: Implicit conversion to a host NumPy array via array is not allowed, To explicitly construct a GPU matrix, consider using .to_cupy()
To explicitly construct a host matrix, consider using .to_numpy().

Integers do work:
df.loc[df['column']=='value', 'column2'] = [0,1]

Both work in pandas.

Rapids 22.08
Ubuntu 20

@Ullar-Kask Ullar-Kask added Needs Triage Need team to review and classify bug Something isn't working labels Oct 19, 2022
@shwina shwina added Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels Oct 19, 2022
@vyasr
Copy link
Contributor

vyasr commented Oct 20, 2022

@Ullar-Kask could you include a MWE, specifically how you constructed your df? Here is what I observe on the latest version of cudf, which looks like the same issue as #11298:

>>> df = cudf.DataFrame({'a': ['c', 'd', 'e'], 'b': ['x', 'y', 'z']})
>>> df.loc[df['a'] == 'c', 'b'] = ['0']
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/nvme/0/vyasr/rapids/cudf/python/cudf/cudf/core/dataframe.py", line 149, in __setitem__
    return self._setitem_tuple_arg(key, value)
  File "/nvme/0/vyasr/rapids/compose/etc/conda/cuda_11.5/envs/rapids/lib/python3.8/contextlib.py", line 75, in inner
    return func(*args, **kwds)
  File "/nvme/0/vyasr/rapids/cudf/python/cudf/cudf/core/dataframe.py", line 380, in _setitem_tuple_arg
    value = cupy.asarray(value)
  File "/nvme/0/vyasr/rapids/compose/etc/conda/cuda_11.5/envs/rapids/lib/python3.8/site-packages/cupy/_creation/from_data.py", line 76, in asarray
    return _core.array(a, dtype, False, order)
  File "cupy/_core/core.pyx", line 2357, in cupy._core.core.array
  File "cupy/_core/core.pyx", line 2381, in cupy._core.core.array
  File "cupy/_core/core.pyx", line 2509, in cupy._core.core._array_default
ValueError: Unsupported dtype <U1

@quasiben
Copy link
Member

@wence- is also working on some potentially related fixes in #11904

@shwina
Copy link
Contributor

shwina commented Oct 20, 2022

As a note, I believe this will work:

df.loc[df['column'] =='value', 'column2'] = cudf.Scalar(['0','1'])

(this is still a bug -- we shouldn't require the user to construct a Scalar manually)

@Ullar-Kask
Copy link
Author

@vyasr I've tried various ways to circumvent the error, e.g.

df.loc[df['column'] =='value', 'column2'] = ['0','1']
or,

neg_lst = random.choices(products_lst, k=len(lst))
df.loc[df['column'] =='value', 'column2'] = neg_lst 

or,
df.loc[df['column'] =='value', 'column2'] = cudf.Series(neg_lst)

but all fail with slightly different error msgs:
ValueError: Unsupported dtype <U1
or
ValueError: Unsupported dtype <U6
or

TypeError: Implicit conversion to a host NumPy array via array is not allowed, To explicitly construct a GPU matrix, consider using .to_cupy()
To explicitly construct a host matrix, consider using .to_numpy().

@Ullar-Kask
Copy link
Author

df = cudf.DataFrame({'a': ['c', 'd', 'e'], 'b': ['x', 'y', 'z']})

BTW, this works:
df['c'] = cudf.Series(['1', '2', '3'])

This also works:
df.loc[:,'d'] = cudf.Series(['1', '2', '3'])

But the last statement fails when re-executed:

TypeError: Implicit conversion to a host NumPy array via __array__ is not allowed, To explicitly construct a GPU matrix, consider using .to_cupy()
To explicitly construct a host matrix, consider using .to_numpy().

@wence-
Copy link
Contributor

wence- commented Oct 21, 2022

Here's a complete example:

import cudf
df = cudf.DataFrame(data={"a": ["yes", "no"], "b": [["l1", "l2"], ["c", "d"]]})

df.loc[df.a == "yes", "b"] = ["hello"]

This goes through _DataFrameLocIndexer._setitem_tuple_arg for which is_scalar returns False, isinstance(value, DataFrame) also returns False, so we try and turn the value ["0", "1"] into a cupy array which fails because cupy can't handle str dtypes.

Now it gets messy, because there's some logic error in list column setitem that precludes many of the other approaches from working.

As @shwina says, for the specific case of setting a "scalar" broadcastable value:

df.loc[df.a == "yes", "b"] = cudf.Scalar(["hello"])

works

If you want to set something more complicated (say):

df = cudf.DataFrame(data={"a": ["yes", "no", "yes"], "b": [["l1", "l2"], ["c", "d"], ["e"]]})

df.loc[df.a == "yes", "b"] = [["a"], ["g"]]

This fails, and all of the workarounds are very internals-heavy:

WARNING, WARNING: Do not use this code!
import cudf
from cudf.core.column.column import ColumnBase, as_column
df = cudf.DataFrame(data={"a": ["yes", "no", "yes"], "b": [["l1", "l2"], ["c", "d"], ["e"]]})

ColumnBase.__setitem__(df.b._column, (df.a == "yes")._column, as_column([["x"], ["y"]])

In [138]: df
Out[138]: 
     a       b
0  yes     [x]
1   no  [c, d]
2  yes     [y]

@wence-
Copy link
Contributor

wence- commented Oct 21, 2022

This also works: df.loc[:,'d'] = cudf.Series(['1', '2', '3'])

But the last statement fails when re-executed:

This works the first time because it's adding a new column. The second time you execute things, the column already exists so you go down the bad code path.

@Ullar-Kask
Copy link
Author

I've found a workaround for my problem. Instead of conditional assignment of string list I create a dataframe using the original index, populate the df and then merge with the original df joining by index:

df_neg= cudf.DataFrame({'col1': cudf.Series(dtype='object')})

for loop:
   mask = df[<condition>].index
   df_tmp = cudf.DataFrame([<some string list>], index=mask, columns=['col1'])
   df_neg= cudf.concat([df_neg, df_tmp], axis=0)

# After the loop:
df = df.merge(df_neg, how='left', right_index=True, left_index=True)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Python Affects Python cuDF API.
Projects
Status: No status
Development

No branches or pull requests

6 participants