[BUG] Assignment of string list to column doesn't work #11944

Ullar-Kask · 2022-10-19T06:53:58Z

Describe the bug

This does not work:

df.loc[df['column'] =='value', 'column2'] = ['0','1']

TypeError: Implicit conversion to a host NumPy array via array is not allowed, To explicitly construct a GPU matrix, consider using .to_cupy()
To explicitly construct a host matrix, consider using .to_numpy().

Integers do work:
df.loc[df['column']=='value', 'column2'] = [0,1]

Both work in pandas.

Rapids 22.08
Ubuntu 20

vyasr · 2022-10-20T16:09:03Z

@Ullar-Kask could you include a MWE, specifically how you constructed your df? Here is what I observe on the latest version of cudf, which looks like the same issue as #11298:

>>> df = cudf.DataFrame({'a': ['c', 'd', 'e'], 'b': ['x', 'y', 'z']})
>>> df.loc[df['a'] == 'c', 'b'] = ['0']
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/nvme/0/vyasr/rapids/cudf/python/cudf/cudf/core/dataframe.py", line 149, in __setitem__
    return self._setitem_tuple_arg(key, value)
  File "/nvme/0/vyasr/rapids/compose/etc/conda/cuda_11.5/envs/rapids/lib/python3.8/contextlib.py", line 75, in inner
    return func(*args, **kwds)
  File "/nvme/0/vyasr/rapids/cudf/python/cudf/cudf/core/dataframe.py", line 380, in _setitem_tuple_arg
    value = cupy.asarray(value)
  File "/nvme/0/vyasr/rapids/compose/etc/conda/cuda_11.5/envs/rapids/lib/python3.8/site-packages/cupy/_creation/from_data.py", line 76, in asarray
    return _core.array(a, dtype, False, order)
  File "cupy/_core/core.pyx", line 2357, in cupy._core.core.array
  File "cupy/_core/core.pyx", line 2381, in cupy._core.core.array
  File "cupy/_core/core.pyx", line 2509, in cupy._core.core._array_default
ValueError: Unsupported dtype <U1

quasiben · 2022-10-20T16:36:59Z

@wence- is also working on some potentially related fixes in #11904

shwina · 2022-10-20T17:10:33Z

As a note, I believe this will work:

df.loc[df['column'] =='value', 'column2'] = cudf.Scalar(['0','1'])

(this is still a bug -- we shouldn't require the user to construct a Scalar manually)

Ullar-Kask · 2022-10-20T17:47:03Z

@vyasr I've tried various ways to circumvent the error, e.g.

df.loc[df['column'] =='value', 'column2'] = ['0','1']
or,

neg_lst = random.choices(products_lst, k=len(lst))
df.loc[df['column'] =='value', 'column2'] = neg_lst

or,
df.loc[df['column'] =='value', 'column2'] = cudf.Series(neg_lst)

but all fail with slightly different error msgs:
ValueError: Unsupported dtype <U1
or
ValueError: Unsupported dtype <U6
or

TypeError: Implicit conversion to a host NumPy array via array is not allowed, To explicitly construct a GPU matrix, consider using .to_cupy()
To explicitly construct a host matrix, consider using .to_numpy().

Ullar-Kask · 2022-10-21T13:33:42Z

df = cudf.DataFrame({'a': ['c', 'd', 'e'], 'b': ['x', 'y', 'z']})

BTW, this works:
df['c'] = cudf.Series(['1', '2', '3'])

This also works:
df.loc[:,'d'] = cudf.Series(['1', '2', '3'])

But the last statement fails when re-executed:

TypeError: Implicit conversion to a host NumPy array via __array__ is not allowed, To explicitly construct a GPU matrix, consider using .to_cupy()
To explicitly construct a host matrix, consider using .to_numpy().

wence- · 2022-10-21T13:35:56Z

Here's a complete example:

import cudf
df = cudf.DataFrame(data={"a": ["yes", "no"], "b": [["l1", "l2"], ["c", "d"]]})

df.loc[df.a == "yes", "b"] = ["hello"]

This goes through _DataFrameLocIndexer._setitem_tuple_arg for which is_scalar returns False, isinstance(value, DataFrame) also returns False, so we try and turn the value ["0", "1"] into a cupy array which fails because cupy can't handle str dtypes.

Now it gets messy, because there's some logic error in list column setitem that precludes many of the other approaches from working.

As @shwina says, for the specific case of setting a "scalar" broadcastable value:

df.loc[df.a == "yes", "b"] = cudf.Scalar(["hello"])

works

If you want to set something more complicated (say):

df = cudf.DataFrame(data={"a": ["yes", "no", "yes"], "b": [["l1", "l2"], ["c", "d"], ["e"]]})

df.loc[df.a == "yes", "b"] = [["a"], ["g"]]

This fails, and all of the workarounds are very internals-heavy:

WARNING, WARNING: Do not use this code!

import cudf
from cudf.core.column.column import ColumnBase, as_column
df = cudf.DataFrame(data={"a": ["yes", "no", "yes"], "b": [["l1", "l2"], ["c", "d"], ["e"]]})

ColumnBase.__setitem__(df.b._column, (df.a == "yes")._column, as_column([["x"], ["y"]])

In [138]: df
Out[138]: 
     a       b
0  yes     [x]
1   no  [c, d]
2  yes     [y]

wence- · 2022-10-21T13:37:11Z

This also works: df.loc[:,'d'] = cudf.Series(['1', '2', '3'])

But the last statement fails when re-executed:

This works the first time because it's adding a new column. The second time you execute things, the column already exists so you go down the bad code path.

Ullar-Kask · 2022-11-02T15:07:16Z

I've found a workaround for my problem. Instead of conditional assignment of string list I create a dataframe using the original index, populate the df and then merge with the original df joining by index:

df_neg= cudf.DataFrame({'col1': cudf.Series(dtype='object')})

for loop:
   mask = df[<condition>].index
   df_tmp = cudf.DataFrame([<some string list>], index=mask, columns=['col1'])
   df_neg= cudf.concat([df_neg, df_tmp], axis=0)

# After the loop:
df = df.merge(df_neg, how='left', right_index=True, left_index=True)

Ullar-Kask added Needs Triage Need team to review and classify bug Something isn't working labels Oct 19, 2022

shwina added Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels Oct 19, 2022

GregoryKimball added this to the Pandas API Alignment and Coverage milestone Nov 19, 2022

wence- self-assigned this Nov 29, 2022

wence- modified the milestones: Pandas API Alignment and Coverage, List and Struct data types and operations Nov 29, 2022

wence- mentioned this issue Feb 16, 2023

[ENH]: Reworking of iloc and loc indexing #12793

Open

wence- mentioned this issue Jun 7, 2023

[FEA] Support non-broadcasting (non-scalar) assignment on list columns #13291

Open

wence- mentioned this issue Mar 6, 2024

[BUG] Unable to update array column for subset of rows #15233

Open

vyasr added this to cuDF Python May 24, 2024

github-project-automation bot moved this to Todo in cuDF Python May 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Assignment of string list to column doesn't work #11944

[BUG] Assignment of string list to column doesn't work #11944

Ullar-Kask commented Oct 19, 2022

vyasr commented Oct 20, 2022

quasiben commented Oct 20, 2022

shwina commented Oct 20, 2022

Ullar-Kask commented Oct 20, 2022

Ullar-Kask commented Oct 21, 2022

wence- commented Oct 21, 2022

wence- commented Oct 21, 2022

Ullar-Kask commented Nov 2, 2022

[BUG] Assignment of string list to column doesn't work #11944

[BUG] Assignment of string list to column doesn't work #11944

Comments

Ullar-Kask commented Oct 19, 2022

vyasr commented Oct 20, 2022

quasiben commented Oct 20, 2022

shwina commented Oct 20, 2022

Ullar-Kask commented Oct 20, 2022

Ullar-Kask commented Oct 21, 2022

wence- commented Oct 21, 2022

wence- commented Oct 21, 2022

Ullar-Kask commented Nov 2, 2022