Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Unable to update array column for subset of rows #15233

Open
dagardner-nv opened this issue Mar 5, 2024 · 2 comments
Open

[BUG] Unable to update array column for subset of rows #15233

dagardner-nv opened this issue Mar 5, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@dagardner-nv
Copy link
Contributor

dagardner-nv commented Mar 5, 2024

Describe the bug
Unable to update values of a dataframe for a column of type array for a subset of rows via .loc with a list[list].

Tested with versions: 24.02.02 & nightly 24.04.00a508

Updating the values for all rows works:

df['a'] = [[0,0,0], [9, 10, 11], [20, 21, 22]]

however updating a subset of rows with a bool mask fails:

df.loc[mask, 'a'] = new_values

I believe at least part of the problem is (python/cudf/cudf/core/column/lists.py) ~line 90:

    def __setitem__(self, key, value):
        if isinstance(value, list):
            value = cudf.Scalar(value)
        if isinstance(value, cudf.Scalar):
            if value.dtype != self.dtype:
                raise TypeError("list nesting level mismatch")

When the column is ListDtype(int64) and the incoming values look like [[9, 10, 11], [20, 21, 22]]
It will create a cudf.Scalar which will have a dtype of ListDtype(ListDtype(int64)) .

Causing the value.dtype != self.dtype to fail.

Steps/Code to reproduce bug

import os

import cupy
import numpy
import pandas as pd
import cudf

data = {'apple': ['pie', 'cake', 'candy'], 'a': [[3,2,1], [4,5,6], [8,7,9]], 'b': [10, 20, 30]}


if os.environ.get('USE_PANDAS') != None:
    df = pd.DataFrame(data)
else:
    df = cudf.DataFrame(data)

mask = [False, True, True]
new_values = [[9, 10, 11], [20, 21, 22]]

# Other datatypes work
df.loc[mask, 'b'] = [25, 35]
df['b'].loc[mask] = [26, 36]
print(df)

print(f"array column type: {repr(df['a'].dtype)}", flush=True)

try:
    df.loc[mask, 'a'] = new_values
except ValueError as e:
    print(f"Encoundered error setting new values ({e}) trying another way")

    try:
        df['a'].loc[mask] = new_values
    except TypeError as e:
        print(f"Encoundered error setting new values ({e})")

print(df)

Expected behavior
Update the values

Environment overview (please complete the following information)

  • Environment location: [Bare-metal]
  • Method of cuDF install: [conda]
@dagardner-nv dagardner-nv added the bug Something isn't working label Mar 5, 2024
@wence-
Copy link
Contributor

wence- commented Mar 6, 2024

Related #11944 and #13291. I think the workaround I posted in #11944 (comment) should work, but I don't like it

@dagardner-nv
Copy link
Contributor Author

@wence- Thanks, the work-around in #11944 works for me. Feel free to close this issue as a duplicate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants