Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Fix list and struct meta generation issue in dask-cudf #10434

Merged
merged 11 commits into from
Mar 16, 2022

Conversation

galipremsagar
Copy link
Contributor

@galipremsagar galipremsagar commented Mar 14, 2022

Fixes: #8913

This PR fixes multiple code-paths that incorrectly handle list and struct types for metadata generation in dask-cudf.

>>> ddf = dask_cudf.from_cudf(df, 1)
>>> ddf._meta_nonempty
   a         b                       c
0  0  [[0, 1]]  {'a': None, 'b': None}
1  1  [[0, 1]]  {'a': None, 'b': None}
>>> df
   a         b                        c
0  1  [[1, 2]]  {'a': 1, 'b': [[1, 2]]}
1  2  [[2, 3]]  {'a': 2, 'b': [[2, 3]]}
2  3      None      {'a': 3, 'b': None}
>>> ddf._meta_nonempty.c.dtype
StructDtype({'a': dtype('int64'), 'b': ListDtype(ListDtype(int64))})
>>> df.c.dtype
StructDtype({'a': dtype('int64'), 'b': ListDtype(ListDtype(int64))})
>>> ddf._meta_nonempty.b.dtype
ListDtype(ListDtype(int64))
>>> df.b.dtype
ListDtype(ListDtype(int64))

@galipremsagar galipremsagar added bug Something isn't working Python Affects Python cuDF API. 4 - Needs cuDF (Python) Reviewer dask Dask issue non-breaking Non-breaking change labels Mar 14, 2022
@galipremsagar galipremsagar self-assigned this Mar 14, 2022
@galipremsagar galipremsagar requested review from a team as code owners March 14, 2022 23:34
@codecov
Copy link

codecov bot commented Mar 15, 2022

Codecov Report

Merging #10434 (e6b2fa2) into branch-22.04 (4596244) will increase coverage by 0.03%.
The diff coverage is 100.00%.

@@               Coverage Diff                @@
##           branch-22.04   #10434      +/-   ##
================================================
+ Coverage         86.13%   86.17%   +0.03%     
================================================
  Files               139      139              
  Lines             22438    22466      +28     
================================================
+ Hits              19328    19361      +33     
+ Misses             3110     3105       -5     
Impacted Files Coverage Δ
python/cudf/cudf/core/tools/numeric.py 89.24% <100.00%> (+0.11%) ⬆️
python/dask_cudf/dask_cudf/backends.py 86.32% <100.00%> (+1.34%) ⬆️
...ython/dask_cudf/dask_cudf/io/tests/test_parquet.py 100.00% <100.00%> (ø)
python/cudf/cudf/core/column/string.py 88.39% <0.00%> (+0.12%) ⬆️
python/cudf/cudf/core/groupby/groupby.py 91.57% <0.00%> (+0.22%) ⬆️
python/cudf/cudf/core/column/numerical.py 95.28% <0.00%> (+0.29%) ⬆️
python/cudf/cudf/core/tools/datetimes.py 84.49% <0.00%> (+0.30%) ⬆️
python/cudf/cudf/core/column/lists.py 90.56% <0.00%> (+0.47%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update deb39db...e6b2fa2. Read the comment docs.

Copy link
Member

@rjzamora rjzamora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great - Thanks @galipremsagar !

Just a few questions/suggestions related to comments.

python/dask_cudf/dask_cudf/backends.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/dtypes.py Outdated Show resolved Hide resolved
python/dask_cudf/dask_cudf/backends.py Outdated Show resolved Hide resolved
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestions attached. 👍

python/cudf/cudf/core/dtypes.py Outdated Show resolved Hide resolved
python/dask_cudf/dask_cudf/backends.py Outdated Show resolved Hide resolved
python/dask_cudf/dask_cudf/backends.py Outdated Show resolved Hide resolved
python/dask_cudf/dask_cudf/backends.py Show resolved Hide resolved
python/dask_cudf/dask_cudf/io/tests/test_parquet.py Outdated Show resolved Hide resolved
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I have one comment to double-check. I'm not sure if I made an error in my previous suggestion with respect to number of nesting levels.

data = ["cat", "dog"]
else:
data = np.array([0, 1], dtype=leaf_type).tolist()
data = _nest_list_data(data, s.dtype) * 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does s.dtype need to be s.dtype.leaf_type? I'm not sure if this is nesting an extra level or if this is the right number of levels. It might have been changed with my previous round of suggestions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leaf_type corresponds to the terminal type of ListDtype, i.e., like int32, int64..

by passing s.dtype(ListDtype) we traverse through each inner type by accessing .element_type in _nest_list_data such that we get the true nesting levels.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example:

>>> s.dtype
ListDtype(ListDtype(int64))
>>> s.leaf_type
int64
>>> s.element_type
ListDtype(int64)

@galipremsagar
Copy link
Contributor Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 1649955 into rapidsai:branch-22.04 Mar 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working dask Dask issue non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Many Dask cuDF operations fail with list or struct columns present in dataframe
3 participants