[BUG] In-place updates with `loc` or `iloc` don't work correctly when the LHS has more than one column #7377

magnatelee · 2021-02-12T06:40:19Z

Describe the bug
In-place updates with loc seem to work only when the RHS is broadcasted. The following enumerates three cases that work just fine in Pandas but crash in cuDF. The same issue exists in iloc.

Steps/Code to reproduce bug

(Stacktraces are elided for brevity.)

Bug 1:

>>> x = cudf.DataFrame({"x" : [1,2,3], "y" : [4,5,6]})
>>> x.loc[[0], ["x", "y"]] = [10, 20]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  ...
  File "/home/nfs/wonchanl/anaconda3/envs/rapids-tpcx-bb/lib/python3.7/site-packages/cudf/core/column/column.py", line 693, in __setitem__
    raise ValueError(msg)
ValueError: Size mismatch: cannot set value of size 2 to indexing result of size 1

# The same code works in Pandas
>>> x = pd.DataFrame({"x" : [1,2,3], "y" : [4,5,6]})
>>> x.loc[[0], ["x", "y"]] = [10, 20]
>>> x
    x   y
0  10  20
1   2   5
2   3   6

Bug 2:

>>> x = cudf.DataFrame({"x" : [1,2,3], "y" : [4,5,6]})
>>> x.loc[[0,2], ["x", "y"]] = [[10, 30], [20, 40]]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  ...
  File "/home/nfs/wonchanl/anaconda3/envs/rapids-tpcx-bb/lib/python3.7/site-packages/pandas/core/dtypes/common.py", line 189, in <lambda>
    issubclass(tipo, klasses)
TypeError: issubclass() arg 1 must be a class
>>> x = pd.DataFrame({"x" : [1,2,3], "y" : [4,5,6]})
>>> x.loc[[0,2], ["x", "y"]] = [[10, 30], [20, 40]]
>>> x
    x   y
0  10  30
1   2   5
2  20  40

Bug 3 (note that the index of the RHS is set to [0, 2] to make it align with the LHS):

>>> x = cudf.DataFrame({"x" : [1,2,3], "y" : [4,5,6]})                                                                                                                                                                          
>>> x.loc[[0,2], ["x", "y"]] = cudf.DataFrame({"x" : [10, 20], "y" : [30, 40]}, index=cudf.Index([0, 2]))                                                                                                                       
Traceback (most recent call last):                                                                                                                                                                                              
  File "/home/nfs/wonchanl/anaconda3/envs/rapids-tpcx-bb/lib/python3.7/site-packages/cudf/core/column/column.py", line 1842, in as_column                                                                                       
    memoryview(arbitrary), dtype=dtype, nan_as_null=nan_as_null
TypeError: memoryview: a bytes-like object is required, not 'DataFrame'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/nfs/wonchanl/anaconda3/envs/rapids-tpcx-bb/lib/python3.7/site-packages/cudf/core/column/column.py", line 1871, in as_column
    else nan_as_null,
  ...
  File "/home/nfs/wonchanl/anaconda3/envs/rapids-tpcx-bb/lib/python3.7/site-packages/cudf/core/dataframe.py", line 958, in __arrow_array__
    "Implicit conversion to a host PyArrow Table via __arrow_array__ "
TypeError: Implicit conversion to a host PyArrow Table via __arrow_array__ is not allowed, To explicitly construct a PyArrow Table, consider using .to_arrow()

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  ...
  File "/home/nfs/wonchanl/anaconda3/envs/rapids-tpcx-bb/lib/python3.7/site-packages/cudf/core/dataframe.py", line 950, in __array__
    "Implicit conversion to a host NumPy array via __array__ is not "
TypeError: Implicit conversion to a host NumPy array via __array__ is not allowed, To explicitly construct a GPU matrix, consider using .as_gpu_matrix()
To explicitly construct a host matrix, consider using .as_matrix()

>>> x = pd.DataFrame({"x" : [1,2,3], "y" : [4,5,6]})
>>> x.loc[[0,2], ["x", "y"]] = pd.DataFrame({"x" : [10, 20], "y" : [30, 40]}, index=pd.Index([0, 2]))
>>> x
    x   y
0  10  30
1   2   5
2  20  40

Expected behavior
The examples above should work rather than crashing.

Environment overview (please complete the following information)

Environment location: Bare-metal
Method of cuDF install: conda (cudf-0.18.0a+157.g04aa30cf11)

The text was updated successfully, but these errors were encountered:

github-actions · 2021-03-14T18:17:52Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

…as more than one column (#9918) Fixes: #7377 This PR enables to `setitem` using a scalar value, dataframe or array/list iterable in both `DataframeLocIndexer `and `DataFrameIlocIndexer `. Only the following cases are currently supported in cudf: - Scalar value: follows the original code path, assigns column- values via specified key (row-label) - Dataframe : checks for column-alignment in LHS and RHS, then uses a scatter map of the indices to assign column-values accordingly. Substitute NA for columns not found in the RHS - All other cases (array, list, range value, etc) : first conversion to cupy array followed by special handling: * If 2d array: If the inner dimension is 1, it's broadcastable to all columns of the dataframe. * Otherwise the value must be a 1d array (scalar values are handled in case 1 above), there are 2 subcases: * If the key on column axis is a scalar, meaning the user is indexing a single column; Therefore 1d value should assign along the columns. * Otherwise, the key on column axis is a 1d array. In this case, the key on row axis can be a scalar or 1d and in both cases of row key, the ith element in value corresponds to the ith row in the indexed object. If the key is 1d, a broadcast will happen. Authors: - Sheilah Kirui (https://github.com/skirui-source) - Michael Wang (https://github.com/isVoid) - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Ashwin Srinath (https://github.com/shwina) - GALI PREM SAGAR (https://github.com/galipremsagar) - Michael Wang (https://github.com/isVoid) URL: #9918

magnatelee added bug Something isn't working Needs Triage Need team to review and classify labels Feb 12, 2021

magnatelee changed the title ~~[BUG] In-place updates with loc or iloc don't work when the LHS and RHS have the same dimensionality~~ [BUG] In-place updates with loc or iloc don't work correctly when the LHS has more than one column Feb 12, 2021

shwina added Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels Feb 12, 2021

github-actions bot added the inactive-30d label Mar 14, 2021

skirui-source self-assigned this Jul 15, 2021

skirui-source mentioned this issue Jul 27, 2021

BUG FIX: In-place updates with loc or iloc don't work correctly when the LHS has more than one column #8876

Closed

skirui-source mentioned this issue Dec 16, 2021

[BUG FIX] In-place updates with loc or iloc don't work correctly when the LHS has more than one column #9918

Merged

rapids-bot bot closed this as completed in #9918 May 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] In-place updates with `loc` or `iloc` don't work correctly when the LHS has more than one column #7377

[BUG] In-place updates with `loc` or `iloc` don't work correctly when the LHS has more than one column #7377

magnatelee commented Feb 12, 2021

github-actions bot commented Mar 14, 2021

[BUG] In-place updates with loc or iloc don't work correctly when the LHS has more than one column #7377

[BUG] In-place updates with loc or iloc don't work correctly when the LHS has more than one column #7377

Comments

magnatelee commented Feb 12, 2021

github-actions bot commented Mar 14, 2021

[BUG] In-place updates with `loc` or `iloc` don't work correctly when the LHS has more than one column #7377

[BUG] In-place updates with `loc` or `iloc` don't work correctly when the LHS has more than one column #7377