Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] dask_cudf constructors turn None into empty string #7735

Closed
lmeyerov opened this issue Mar 26, 2021 · 0 comments · Fixed by #7746
Closed

[BUG] dask_cudf constructors turn None into empty string #7735

lmeyerov opened this issue Mar 26, 2021 · 0 comments · Fixed by #7746
Assignees
Labels
bug Something isn't working dask Dask issue Python Affects Python cuDF API.

Comments

@lmeyerov
Copy link

Describe the bug

Constructing a dask_cudf Series with some None vals then doing dgs.dropna() on a dask col should drop the None values, but instead they appear to turn into empty strings and thus stick around

Steps/Code to reproduce bug

gs = cudf.Series(['a', 'b', None])

# ### gdf drops None as expected ### #

print(gs.dropna().to_pandas().to_list())  # => ['a', 'b']

# ### dask does not... even with map partitions ### #

with Client('dask-scheduler:8786'):
    print(dask_cudf.from_cudf(gs, npartitions=2).dropna().compute().to_pandas().to_list())
    # => ['a', 'b', '']
    
with Client('dask-scheduler:8786'):
    print(dask_cudf.from_cudf(gs, npartitions=2).fillna(value=np.nan).dropna().compute().to_pandas().to_list())
    # => ['a', 'b', '']

with Client('dask-scheduler:8786'):

    def gdf_dropper(gs):
        return gs.dropna()

    print(dask_cudf.from_cudf(gs, npartitions=2).map_partitions(gdf_dropper).compute().to_pandas().to_list())
    # => ['a', 'b', '']

Maybe dask_cudf is turning None into '' , so a constructor bug?

with Client('dask-scheduler:8786'):
    print(dask_cudf.from_cudf(gs, npartitions=2).compute().to_pandas().to_list())
    # => ['a', 'b', '']

Expected behavior

Dask seems fine here:

with Client('dask-scheduler:8786'):
    print(dask.dataframe.from_pandas(gs.to_pandas(), npartitions=2).dropna().compute().to_list())
    # => ['a', 'b']

Environment overview (please complete the following information)

docker -> mamba -> rapids 0.18

Additional context

Slack: https://rapids-goai.slack.com/archives/C5E06F4DC/p1616723705030300

Blocking: graphistry/pygraphistry#225

@lmeyerov lmeyerov added Needs Triage Need team to review and classify bug Something isn't working labels Mar 26, 2021
@kkraus14 kkraus14 added Python Affects Python cuDF API. dask Dask issue and removed Needs Triage Need team to review and classify labels Mar 26, 2021
rapids-bot bot pushed a commit that referenced this issue Mar 29, 2021
Fixes: #7735 

Minimal repro of the above issue is:


```python
>>> import cudf
>>> s = cudf.Series(['hi', 'hello', None])
>>> s
0       hi
1    hello
2     <NA>
dtype: string
>>> h = s[0:3]
0       hi
1    hello
2     <NA>
dtype: string
>>> s._column.null_count
1
>>> h._column.null_count
1
```


Incorrect mask calculation in `Column.from_column_view` because of incorrect `base_size` calculation in `StringColumn`:
```python
>>> s._column.mask.to_host_array()
array([3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
      dtype=uint8)
>>> h._column.mask.to_host_array()
array([], dtype=uint8) # Should have a mask similar to above one.

>>> s._column.base_size
0 # Should be 3
>>> h._column.base_size
0 # Should be 3
```

So in this PR I have fixed the calculation of `StringColumn.base_size` and introduced tests to have a check for the same.

Authors:
  - GALI PREM SAGAR (@galipremsagar)

Approvers:
  - Keith Kraus (@kkraus14)

URL: #7746
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working dask Dask issue Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants