Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Loc and iloc Indexing with a scalar on list columns fails #8032

Closed
esnvidia opened this issue Apr 22, 2021 · 7 comments
Closed

[BUG] Loc and iloc Indexing with a scalar on list columns fails #8032

esnvidia opened this issue Apr 22, 2021 · 7 comments
Labels
feature request New feature or request Python Affects Python cuDF API.

Comments

@esnvidia
Copy link

Describe the bug
Getting a runtime error when viewing a list view after an iloc or loc after a string split. Ex.:

Example data:
12345       Bad Card Number,Bad CVV,
37880       Bad Zipcode,Technical Glitch,

command
gdf.errors.str.strip(',').str.split(',').loc[37880]  # or iloc

RuntimeError: cuDF failure at: /opt/conda/envs/rapids/conda-bld/libcudf_1615843445425/work/cpp/src/copying/get_element.cu:125: get_element_functor not supported for list_view

Steps/Code to reproduce bug
Follow this guide http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports to craft a minimal bug report. This helps us reproduce the issue you're having and resolve the issue more quickly.

Expected behavior
array(['Bad Zipcode', 'Technical Glitch'], dtype=object)

Environment overview (please complete the following information)

  • Environment location: [Bare-metal, Docker, Cloud(specify cloud provider)]
  • Method of cuDF install: [conda, Docker, or from source]
    • If method of install is [Docker], provide docker pull & docker run commands used

conda install. cuDF version 0.18.1

Environment details
Please run and paste the output of the cudf/print_env.sh script here, to gather any other relevant environment details

Additional context
Add any other context about the problem here.

@esnvidia esnvidia added Needs Triage Need team to review and classify bug Something isn't working labels Apr 22, 2021
@beckernick
Copy link
Member

beckernick commented Apr 22, 2021

Additional context:

import cudfs = cudf.Series([[0,1], [0,2]])
print(s.loc[1:1]) # succeeds
s.loc[1]
1    [0, 2]
dtype: list
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-67-7c294f9b6e79> in <module>
      3 s = cudf.Series([[0,1], [0,2]])
      4 print(s.loc[1:1])
----> 5 s.loc[1]

/raid/nicholasb/miniconda3/envs/rapids-gpubdb-20210421/lib/python3.8/site-packages/cudf/core/indexing.py in __getitem__(self, arg)
    144             raise KeyError(arg)
    145 
--> 146         return self._sr.iloc[arg]
    147 
    148     def __setitem__(self, key, value):

/raid/nicholasb/miniconda3/envs/rapids-gpubdb-20210421/lib/python3.8/site-packages/cudf/core/indexing.py in __getitem__(self, arg)
     81         if isinstance(arg, tuple):
     82             arg = list(arg)
---> 83         data = self._sr._column[arg]
     84 
     85         if is_scalar(data) or _is_null_host_scalar(data):

/raid/nicholasb/miniconda3/envs/rapids-gpubdb-20210421/lib/python3.8/site-packages/cudf/core/column/column.py in __getitem__(self, arg)
    669     def __getitem__(self, arg) -> Union[ScalarLike, ColumnBase]:
    670         if is_scalar(arg):
--> 671             return self.element_indexing(int(arg))
    672         elif isinstance(arg, slice):
    673             start, stop, stride = arg.indices(len(self))

/raid/nicholasb/miniconda3/envs/rapids-gpubdb-20210421/lib/python3.8/site-packages/cudf/core/column/column.py in element_indexing(self, index)
    647             raise IndexError("single positional indexer is out-of-bounds")
    648 
--> 649         return libcudf.copying.get_element(self, idx).value
    650 
    651     def slice(self, start: int, stop: int, stride: int = None) -> ColumnBase:

cudf/_lib/copying.pyx in cudf._lib.copying.get_element()

RuntimeError: cuDF failure at: ../src/copying/get_element.cu:125: get_element_functor not supported for list_view
# packages in environment at /raid/nicholasb/miniconda3/envs/rapids-gpubdb-20210421:
arrow-cpp                 1.0.1           py38hcb5322d_14_cuda    conda-forge
arrow-cpp-proc            3.0.0                      cuda    conda-forge
cudf                      0.20.0a210421   cuda_11.0_py38_gd501d2c0b9_179    rapidsai-nightly
cuml                      0.20.0a210421   cuda11.0_py38_g2870d59d8_80    rapidsai-nightly
dask                      2021.4.0           pyhd8ed1ab_0    conda-forge
dask-core                 2021.4.0           pyhd8ed1ab_0    conda-forge
dask-cuda                 0.20.0a210421           py38_17    rapidsai-nightly
dask-cudf                 0.20.0a210421   py38_gd501d2c0b9_179    rapidsai-nightly
libcudf                   0.20.0a210421   cuda11.0_gd501d2c0b9_179    rapidsai-nightly
libcuml                   0.20.0a210421   cuda11.0_g2870d59d8_80    rapidsai-nightly
libcumlprims              0.20.0a210408   cuda11.0_g7f19636_2    rapidsai-nightly
librmm                    0.20.0a210421   cuda11.0_g288e8be_17    rapidsai-nightly
numpy                     1.19.5           py38h18fd61f_1    conda-forge
pandas                    1.2.4            py38h1abd341_0    conda-forge
pyarrow                   1.0.1           py38h3e2403a_14_cuda    conda-forge
rmm                       0.20.0a210421   cuda_11.0_py38_g288e8be_17    rapidsai-nightly
scipy                     1.6.2            py38h7b17777_0    conda-forge
ucx                       1.9.0+gcd9efd3       cuda11.0_0    rapidsai-nightly
ucx-proc                  1.0.0                       gpu    rapidsai-nightly
ucx-py                    0.20.0a210419   py38_gcd9efd3_5    rapidsai-nightly```

@beckernick beckernick changed the title [BUG] [BUG] Loc and iloc Indexing with a scalar on list columns fails Apr 22, 2021
@kkraus14 kkraus14 added Python Affects Python cuDF API. feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify bug Something isn't working labels Apr 22, 2021
@kkraus14
Copy link
Collaborator

cc @isVoid as this is likely related to list scalars

@brandon-b-miller
Copy link
Contributor

I think we'll need list scalar support built out at least in cython for this to work. Generally a __getitem__ call, which is what this ultimately resolves to, ends up expecting libcudf to hand it off a c++ scalar and then bubble its value up to the user, first as a DeviceScalar and then some kind of host value.

I get the same error just doing this:

>>> s = cudf.Series([[0,1], [0,2]])
>>> s[0]

As for what the value should be when it gets to the user that is of course an open question, but I am leaning towards a pyarrow array so that nulls can be handled and away from a numpy array of object dtype.

@shwina
Copy link
Contributor

shwina commented Apr 23, 2021

I'm not 100% sure about returning PyArrow objects directly to the user. There's no precedent for doing that, and users who haven't used PyArrow before will now have yet another library to learn about.

A plain list of NumPy scalars, or a NumPy array (both with cudf.NA for nulls), while suboptimal, might actually be the more user friendly choice here.

@brandon-b-miller
Copy link
Contributor

Between the two of those I'd rather we do a python list containing numpy scalars and cudf.NA than a numpy array of object type. I just think that it becomes a little hard to understand exactly what object means in the cuDF universe if we're using it to denote string data and also using it here.

rapids-bot bot pushed a commit that referenced this issue May 7, 2021
Part1 of #8032 

This PR adds retrieval of row data from a `LIST` type column, through adding support to `list_view` specialization of `get_element`. The row data is stored in a scalar object.

Use example:
```
// non-nested LIST column
col = [{1, 2, 3}, {4}]
s = get_element(col, 1); // s is a type erased list_scalar, s._data == int_column{4}

// nested LIST column
col = [[{1, 2}, {3}], [{4}, {}]]
s = get_element(col, 1); // s is a type erased list_scalar, s._data == list_column{{4}, {}}
```

Implementation note:
Depends on `lists::detail::copy_slice` under the hood. Also adds a new `list_scalar` constructor that supports moving external row data to construct a new scalar.

Other included in this PR:
- `is_element_valid_sync(column, i)`, helper function that returns true if `i`th row of `column` is valid.
- `list_scalar` factory functions
- Developer guide for `list_scalar`

Authors:
  - Michael Wang (https://github.com/isVoid)

Approvers:
  - Robert Maynard (https://github.com/robertmaynard)
  - AJ Schmidt (https://github.com/ajschmidt8)
  - Mark Harris (https://github.com/harrism)
  - https://github.com/nvdbaranec
  - Nghia Truong (https://github.com/ttnghia)

URL: #8071
@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@kkraus14 kkraus14 removed inactive-30d libcudf Affects libcudf (C++/CUDA) code. labels May 24, 2021
@brandon-b-miller
Copy link
Contributor

Fixed by #8265

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request Python Affects Python cuDF API.
Projects
None yet
Development

No branches or pull requests

5 participants