Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] In-place updates with loc or iloc don't work correctly when the LHS has more than one column #7377

Closed
magnatelee opened this issue Feb 12, 2021 · 1 comment · Fixed by #9918
Assignees
Labels
bug Something isn't working Python Affects Python cuDF API.

Comments

@magnatelee
Copy link
Contributor

Describe the bug
In-place updates with loc seem to work only when the RHS is broadcasted. The following enumerates three cases that work just fine in Pandas but crash in cuDF. The same issue exists in iloc.

Steps/Code to reproduce bug

(Stacktraces are elided for brevity.)

Bug 1:

>>> x = cudf.DataFrame({"x" : [1,2,3], "y" : [4,5,6]})
>>> x.loc[[0], ["x", "y"]] = [10, 20]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  ...
  File "/home/nfs/wonchanl/anaconda3/envs/rapids-tpcx-bb/lib/python3.7/site-packages/cudf/core/column/column.py", line 693, in __setitem__
    raise ValueError(msg)
ValueError: Size mismatch: cannot set value of size 2 to indexing result of size 1

# The same code works in Pandas
>>> x = pd.DataFrame({"x" : [1,2,3], "y" : [4,5,6]})
>>> x.loc[[0], ["x", "y"]] = [10, 20]
>>> x
    x   y
0  10  20
1   2   5
2   3   6

Bug 2:

>>> x = cudf.DataFrame({"x" : [1,2,3], "y" : [4,5,6]})
>>> x.loc[[0,2], ["x", "y"]] = [[10, 30], [20, 40]]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  ...
  File "/home/nfs/wonchanl/anaconda3/envs/rapids-tpcx-bb/lib/python3.7/site-packages/pandas/core/dtypes/common.py", line 189, in <lambda>
    issubclass(tipo, klasses)
TypeError: issubclass() arg 1 must be a class
>>> x = pd.DataFrame({"x" : [1,2,3], "y" : [4,5,6]})
>>> x.loc[[0,2], ["x", "y"]] = [[10, 30], [20, 40]]
>>> x
    x   y
0  10  30
1   2   5
2  20  40

Bug 3 (note that the index of the RHS is set to [0, 2] to make it align with the LHS):

>>> x = cudf.DataFrame({"x" : [1,2,3], "y" : [4,5,6]})                                                                                                                                                                          
>>> x.loc[[0,2], ["x", "y"]] = cudf.DataFrame({"x" : [10, 20], "y" : [30, 40]}, index=cudf.Index([0, 2]))                                                                                                                       
Traceback (most recent call last):                                                                                                                                                                                              
  File "/home/nfs/wonchanl/anaconda3/envs/rapids-tpcx-bb/lib/python3.7/site-packages/cudf/core/column/column.py", line 1842, in as_column                                                                                       
    memoryview(arbitrary), dtype=dtype, nan_as_null=nan_as_null
TypeError: memoryview: a bytes-like object is required, not 'DataFrame'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/nfs/wonchanl/anaconda3/envs/rapids-tpcx-bb/lib/python3.7/site-packages/cudf/core/column/column.py", line 1871, in as_column
    else nan_as_null,
  ...
  File "/home/nfs/wonchanl/anaconda3/envs/rapids-tpcx-bb/lib/python3.7/site-packages/cudf/core/dataframe.py", line 958, in __arrow_array__
    "Implicit conversion to a host PyArrow Table via __arrow_array__ "
TypeError: Implicit conversion to a host PyArrow Table via __arrow_array__ is not allowed, To explicitly construct a PyArrow Table, consider using .to_arrow()

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  ...
  File "/home/nfs/wonchanl/anaconda3/envs/rapids-tpcx-bb/lib/python3.7/site-packages/cudf/core/dataframe.py", line 950, in __array__
    "Implicit conversion to a host NumPy array via __array__ is not "
TypeError: Implicit conversion to a host NumPy array via __array__ is not allowed, To explicitly construct a GPU matrix, consider using .as_gpu_matrix()
To explicitly construct a host matrix, consider using .as_matrix()

>>> x = pd.DataFrame({"x" : [1,2,3], "y" : [4,5,6]})
>>> x.loc[[0,2], ["x", "y"]] = pd.DataFrame({"x" : [10, 20], "y" : [30, 40]}, index=pd.Index([0, 2]))
>>> x
    x   y
0  10  30
1   2   5
2  20  40

Expected behavior
The examples above should work rather than crashing.

Environment overview (please complete the following information)

  • Environment location: Bare-metal
  • Method of cuDF install: conda (cudf-0.18.0a+157.g04aa30cf11)
@magnatelee magnatelee added bug Something isn't working Needs Triage Need team to review and classify labels Feb 12, 2021
@magnatelee magnatelee changed the title [BUG] In-place updates with loc or iloc don't work when the LHS and RHS have the same dimensionality [BUG] In-place updates with loc or iloc don't work correctly when the LHS has more than one column Feb 12, 2021
@shwina shwina added Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels Feb 12, 2021
@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@skirui-source skirui-source self-assigned this Jul 15, 2021
@rapids-bot rapids-bot bot closed this as completed in #9918 May 4, 2022
rapids-bot bot pushed a commit that referenced this issue May 4, 2022
…as more than one column (#9918)

Fixes: #7377

This PR enables to `setitem` using a scalar value, dataframe  or  array/list iterable in both `DataframeLocIndexer `and  `DataFrameIlocIndexer `. Only the following cases are currently supported in cudf:
- Scalar value: follows the original code path, assigns column- values via specified  key (row-label)
- Dataframe : checks for column-alignment in LHS and RHS, then uses a scatter map of the indices to assign column-values accordingly. Substitute NA for columns not found in the RHS
- All other cases (array, list, range value, etc) :  first conversion to cupy array followed by special handling:
   * If 2d array:  If the inner dimension is 1, it's broadcastable to all columns of the dataframe.
   * Otherwise the value must be a 1d array (scalar values are handled in case 1 above), there are 2 subcases:
     * If the key on column axis is a scalar, meaning the user is indexing a single column; Therefore 1d value should assign along the columns.
     * Otherwise, the key on column axis is a 1d array. In this case, the key on row axis can be a scalar or 1d and in both cases of row key, the ith element in value corresponds to the ith row in the indexed object. If the key is 1d, a broadcast will happen.

Authors:
  - Sheilah Kirui (https://github.com/skirui-source)
  - Michael Wang (https://github.com/isVoid)
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Ashwin Srinath (https://github.com/shwina)
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Michael Wang (https://github.com/isVoid)

URL: #9918
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment