Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Dask_cudf alters the metadata processing on CPU dask dataframe apply if metadata is passed #7946

Closed
beckernick opened this issue Apr 13, 2021 · 1 comment · Fixed by #8342
Assignees
Labels
bug Something isn't working dask Dask issue Python Affects Python cuDF API.

Comments

@beckernick
Copy link
Member

beckernick commented Apr 13, 2021

When dask_cudf is imported and a user calls apply on a pandas backed dask dataframe, dask-cudf alters the metadata creation step to use cudf if metadata is supplied. This can cause confusing downstream errors as the user will unexpectedly be operating on the GPU. If metadata is not explicitly supplied, Dask will continue to use pandas as expected. This does not happen if dask_cudf is not imported.

import pandas as pd
import dask.dataframe as dd
import dask_cudfdf = pd.DataFrame({'a': [3,4], 'b': [1, 2]})
ddf = dd.from_pandas(df, npartitions=1)
emb = ddf['a'].apply(pd.Series, meta={'c0': 'int64', 'c1': 'int64'})
print(type(emb._meta))
print(type(emb))
<class 'cudf.core.dataframe.DataFrame'>
<class 'dask_cudf.core.DataFrame'>
import pandas as pd
import dask.dataframe as dd
import dask_cudfdf = pd.DataFrame({'a': [3,4], 'b': [1, 2]})
ddf = dd.from_pandas(df, npartitions=1)
emb = ddf['a'].apply(pd.Series)
print(type(emb._meta))
print(type(emb))
<class 'pandas.core.frame.DataFrame'>
<class 'dask.dataframe.core.DataFrame'>
/raid/nicholasb/miniconda3/envs/rapids-gpubdb-20210413/lib/python3.8/site-packages/dask/dataframe/core.py:3519: UserWarning: 
You did not provide metadata, so Dask is running your function on a small dataset to guess output types. It is possible that Dask will guess incorrectly.
To provide an explicit output types or to silence this message, please provide the `meta=` keyword, as described in the map or apply function that you are using.
  Before: .apply(func)
  After:  .apply(func, meta={0: 'int64'})

  warnings.warn(meta_warning(meta))
conda list | grep "rapids\|dask\|pandas\|arrow\|numpy\|scipy"
# packages in environment at /raid/nicholasb/miniconda3/envs/rapids-gpubdb-20210413:
arrow-cpp                 1.0.1           py38h9018dff_36_cuda    conda-forge
arrow-cpp-proc            3.0.0                      cuda    conda-forge
cudf                      0.20.0a210413   cuda_11.2_py38_gd6479a20d8_137    rapidsai-nightly
cuml                      0.20.0a210413   cuda11.2_py38_g5f61a3519_74    rapidsai-nightly
dask                      2021.4.0           pyhd8ed1ab_0    conda-forge
dask-core                 2021.4.0           pyhd8ed1ab_0    conda-forge
dask-cuda                 0.20.0a210413            py38_9    rapidsai-nightly
dask-cudf                 0.20.0a210413   py38_gd6479a20d8_137    rapidsai-nightly
libcudf                   0.20.0a210413   cuda11.2_gd6479a20d8_137    rapidsai-nightly
libcuml                   0.20.0a210413   cuda11.2_g5f61a3519_74    rapidsai-nightly
libcumlprims              0.20.0a210408   cuda11.2_g7f19636_2    rapidsai-nightly
librmm                    0.20.0a210413   cuda11.2_g80bfeb2_13    rapidsai-nightly
numpy                     1.20.2           py38h9894fe3_0    conda-forge
pandas                    1.2.4            py38h1abd341_0    conda-forge
pyarrow                   1.0.1           py38hb53058b_36_cuda    conda-forge
rmm                       0.20.0a210413   cuda_11.2_py38_g80bfeb2_13    rapidsai-nightly
scipy                     1.6.2            py38h7b17777_0    conda-forge
ucx                       1.9.0+gcd9efd3       cuda11.2_0    rapidsai-nightly
ucx-proc                  1.0.0                       gpu    rapidsai-nightly
ucx-py                    0.20.0a210413   py38_gcd9efd3_2    rapidsai-nightly

cc @viclafargue

@beckernick beckernick added bug Something isn't working Python Affects Python cuDF API. dask Dask issue labels Apr 13, 2021
@galipremsagar galipremsagar self-assigned this Apr 13, 2021
@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

kkraus14 pushed a commit that referenced this issue May 26, 2021
Fixes: #7946 

This PR is dependent on upstream dask changes that are needed for a portion of the fix: https://github.com/dask/dask/pull/7586/files

This PR includes changes to introduce `make_meta_obj` which will ensure proper metadata is retrieved from the parent_meta being passed in the upstream PR. 

Authors:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - jakirkham (https://github.com/jakirkham)
  - Keith Kraus (https://github.com/kkraus14)

URL: #8342
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working dask Dask issue Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants