Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] get_dummies fails in dask-cudf due to dask categorical type checking #7111

Closed
beckernick opened this issue Jan 11, 2021 · 8 comments · Fixed by #7740
Closed

[BUG] get_dummies fails in dask-cudf due to dask categorical type checking #7111

beckernick opened this issue Jan 11, 2021 · 8 comments · Fixed by #7740
Labels
bug Something isn't working dask Dask issue Python Affects Python cuDF API.

Comments

@beckernick
Copy link
Member

beckernick commented Jan 11, 2021

Calling dd.get_dummies fails with dask-cudf due to Dask's reliance on the pd.api.types.is_categorical_dtype check from pandas. Our categorical columns do not return True for this check. Instead, we can use cudf.utils.dtype.is_categorical_dtype.

This issue is for tracking purposes. We'll (probably) want to abstract this from pandas in upstream Dask.

import pandas as pd
import dask.dataframe as dd
import cudf

df = pd.DataFrame(
    {"A":["a","b","b"],
     "B":[1,2,3]
    })
ddf = dd.from_pandas(df, 2)
ddf = ddf.categorize(columns=["B"])
dd.get_dummies(ddf, columns=['B']) # works
import pandas as pd
import dask.dataframe as dd
import cudfdf = pd.DataFrame(
    {"A":["a","b","b"],
     "B":[1,2,3]
    })
ddf = dd.from_pandas(df, 2)
ddf = ddf.map_partitions(cudf.from_pandas)
ddf = ddf.categorize(columns=["B"])
dd.get_dummies(ddf, columns=['B'])
---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
<ipython-input-9-ec7ee286026f> in <module>
     10 ddf = ddf.map_partitions(cudf.from_pandas)
     11 ddf = ddf.categorize(columns=["B"])
---> 12 dd.get_dummies(ddf, columns=['B'])

/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-automated-tests/lib/python3.7/site-packages/dask/dataframe/reshape.py in get_dummies(data, prefix, prefix_sep, dummy_na, columns, sparse, drop_first, dtype, **kwargs)
    142         else:
    143             if not all(is_categorical_dtype(data[c]) for c in columns):
--> 144                 raise NotImplementedError(not_cat_msg)
    145 
    146         if not all(has_known_categories(data[c]) for c in columns):

NotImplementedError: `get_dummies` with non-categorical dtypes is not supported. Please use `df.categorize()` beforehand to convert to categorical dtype.
conda list | grep "rapids\|dask"
cudf                      0.18.0a210110   cuda_10.2_py37_g04aa30cf11_157    rapidsai-nightly
cuml                      0.18.0a210110   cuda10.2_py37_gc021decd9_70    rapidsai-nightly
dask                      2020.12.0          pyhd8ed1ab_0    conda-forge
dask-core                 2020.12.0          pyhd8ed1ab_0    conda-forge
dask-cuda                 0.18.0a201211           py37_39    rapidsai-nightly
dask-cudf                 0.18.0a210110   py37_g04aa30cf11_157    rapidsai-nightly
faiss-proc                1.0.0                      cuda    rapidsai-nightly
libcudf                   0.18.0a210110   cuda10.2_g04aa30cf11_157    rapidsai-nightly
libcuml                   0.18.0a210110   cuda10.2_gc021decd9_70    rapidsai-nightly
libcumlprims              0.18.0a201203   cuda10.2_gff080f3_0    rapidsai-nightly
librmm                    0.18.0a210110   cuda10.2_g94b083a_20    rapidsai-nightly
rmm                       0.18.0a210110   cuda_10.2_py37_g94b083a_20    rapidsai-nightly
ucx                       1.8.1+g6b29558       cuda10.2_0    rapidsai-nightly
ucx-proc                  1.0.0                       gpu    rapidsai-nightly
ucx-py                    0.18.0a210110   py37_g6b29558_10    rapidsai-nightly
@beckernick beckernick added bug Something isn't working Python Affects Python cuDF API. dask Dask issue labels Jan 11, 2021
@beckernick
Copy link
Member Author

beckernick commented Jan 11, 2021

Looks like there's a lot of enhancement going on for cuDF.dtype in #6160 and some of it touches is_categorical_dtype. It looks to me like we'll still want to generalize this dtype checking in upstream Dask (as cuDF categoricals will not pass the pandas is_categorical_dtype check). @brandon-b-miller does that sound correct to you?

EDIT: It looks like this might be handled already, by CategoricalDtype inheriting from Generic which inherits from pandas' ExtensionDtype.

@brandon-b-miller
Copy link
Contributor

Hey @beckernick , I put some thoughts in the other dask thread.

@harrism
Copy link
Member

harrism commented Jan 13, 2021

Pardon the drive-by comment, but while you are fixing this bug, consider renaming the function to get_placeholders or something non-ableist. https://developers.google.com/style/inclusive-documentation#ableist-language

@github-actions
Copy link

This issue has been marked stale due to no recent activity in the past 30d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be marked rotten if there is no activity in the next 60d.

@github-actions github-actions bot added the stale label Feb 16, 2021
@beckernick
Copy link
Member Author

Collecting from the related Dask issue dask/dask#7054 , it seems like there are several items of interest.

  • In terms of immediate blockers, pd.api.types.is_categorical_dtype works correctly for cudf.CategoricalDtype but not for Series of category type. This likely is the root cause of this issue. I will file a separate issue for this individual bug and cross-link it.
  • @brandon-b-miller mentioned that some broader generalizations around dtype checking may be desirable with the ongoing cudf.dtype work. We should evaluate if these are blocking any aspects of the dtype work and re-open the discussion.
  • We should update the naming of this feature to something non-ableist. I would recommend we use get_indicators, as a reference to how these variables are often called indicator variables in statistics. This move should ideally be done in concert with the pandas and Dask community, which is discussing the topic in QST: "Dummy" is rooted in ableist language pandas-dev/pandas#35724 .

@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@brandon-b-miller
Copy link
Contributor

I have been looking into this a bit, and from what I can tell, the change to fix this could be handled in any of Dask, Pandas, or cuDF (which is part of what I think makes it hard to figure out where to even discuss this).

The issue is that dask wants to use pd.api.types.is_categorical_dtype to identify Series objects of categorical dtype. pd.api.types.is_categorical_dtype returns True both for pd.CategoricalDtype objects, and cudf.CategorcicalDtype objects, because both are instances of pd.ExtensionDtype named category, due to the early return here:

>>> pd.api.types.is_categorical_dtype(pd.CategoricalDtype())
True
>>> pd.api.types.is_categorical_dtype(cudf.CategoricalDtype())
True

pd.api.types.is_categorical_dtype also returns True for pandas data structures with a categorical dtype as well, but not for the equivalent cuDF data structures. This is the place in pandas where the handling of cuDF vs pandas objects diverges, and it happens because even though cudf.CategoricalDtype objects inherit from pd.ExtensionDtype, they don't inherit from pd.CategoricalDtype, directly, and so they don't pass this check.

With that in mind then, I think these are the possible solutions:

  • cudf.CategoricalDtype is changed to inherit from pd.CategoricalDtype, at which point cuDF objects pass the existing check
  • pd.api.types.is_categorical_dtype is updated to getattr the dtype of the incoming object and then calls itself again on the result of that
  • Dask is generalized to support cudf.utils.dtypes.is_categorical_dtype through some kind of dispatch mechanism, similar to https://github.com/dask/dask/pull/6444/files.

Does anyone see any other solutions, or is anyone preferential to any one of the ones listed here?

@shwina
Copy link
Contributor

shwina commented Mar 23, 2021

cudf.CategoricalDtype is changed to inherit from pd.CategoricalDtype, at which point cuDF objects pass the existing check

This should be the least preferred option. A cuDF.CategoricalDtype does not have an is-a relationship with pd.CategoricalDtype. Inheritance would suggest that you can pass a cudf.CategoricalDtype to a function expecting a pd.CategoricalDtype and that's certainly not the case.

Dask is generalized to support cudf.utils.dtypes.is_categorical_dtype through some kind of dispatch mechanism, similar to https://github.com/dask/dask/pull/6444/files.

This sounds like the right approach perhaps, where cuDF's is_categorical_dtype checks if something is a cudf.CategoricalDtype, and Pandas' is_categorical_dtype checks if something is a pd.CategoricalDtype.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working dask Dask issue Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants