Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cudf.DataFrame.applymap #10542

Merged

Conversation

brandon-b-miller
Copy link
Contributor

Naive implementation of DataFrame.applymap that just calls apply in a loop over columns.

This could theoretically be made much faster within our framework. This requires at worst N compilations and M kernel launches, where N is the number of different dtypes in the data, and M is the number of total columns. We could however as an improvement to this launch just one kernel that populates the entire output data. This would still suffer from the compilation bottleneck however, since the function must be compiled in order for an output dtype to be determined, and this will need to be done for each distinct dtype within the data.

Part of #10169

@brandon-b-miller brandon-b-miller added feature request New feature or request 2 - In Progress Currently a work in progress numba Numba issue Python Affects Python cuDF API. dask Dask issue non-breaking Non-breaking change labels Mar 30, 2022
@brandon-b-miller brandon-b-miller requested review from a team as code owners March 30, 2022 16:33
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few small suggestions.

python/cudf/cudf/core/dataframe.py Show resolved Hide resolved
python/cudf/cudf/core/dataframe.py Show resolved Hide resolved
func : callable
Python function, returns a single value from a single value.
na_action : {None, 'ignore'}, default None
If ``ignore``, propagate NaN values, without passing them to func.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use quotes here, not code font.

Suggested change
If ``ignore``, propagate NaN values, without passing them to func.
If 'ignore', propagate NaN values, without passing them to func.

"""

if kwargs:
raise ValueError(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we usually raise NotImplementedError for this kind of thing, and ValueError for invalid values (like na_action not in {"ignore", None} below).

Suggested change
raise ValueError(
raise NotImplementedError(

@codecov
Copy link

codecov bot commented Apr 6, 2022

Codecov Report

Merging #10542 (e1d444c) into branch-22.06 (3c13ef1) will increase coverage by 0.04%.
The diff coverage is 98.14%.

@@               Coverage Diff                @@
##           branch-22.06   #10542      +/-   ##
================================================
+ Coverage         86.33%   86.38%   +0.04%     
================================================
  Files               140      142       +2     
  Lines             22289    22338      +49     
================================================
+ Hits              19244    19296      +52     
+ Misses             3045     3042       -3     
Impacted Files Coverage Δ
python/cudf/cudf/core/algorithms.py 90.47% <ø> (ø)
python/cudf/cudf/core/groupby/groupby.py 91.72% <ø> (+0.22%) ⬆️
python/cudf/cudf/core/multiindex.py 92.14% <ø> (ø)
python/cudf/cudf/core/series.py 95.28% <ø> (ø)
python/cudf/cudf/core/single_column_frame.py 96.52% <ø> (+0.07%) ⬆️
python/cudf/cudf/utils/cudautils.py 59.83% <ø> (ø)
python/cudf/cudf/utils/utils.py 90.28% <ø> (ø)
python/dask_cudf/dask_cudf/tests/utils.py 90.90% <90.90%> (ø)
python/cudf/cudf/core/column/lists.py 92.79% <100.00%> (+1.38%) ⬆️
python/cudf/cudf/core/dataframe.py 93.69% <100.00%> (+0.10%) ⬆️
... and 15 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9e8e92c...e1d444c. Read the comment docs.

@brandon-b-miller brandon-b-miller added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Apr 7, 2022
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a few minor (non-blocking) suggestions. This looks good overall!

python/cudf/cudf/core/dataframe.py Outdated Show resolved Hide resolved
python/cudf/cudf/tests/test_applymap.py Outdated Show resolved Hide resolved
python/dask_cudf/dask_cudf/tests/test_applymap.py Outdated Show resolved Hide resolved
df = pd.DataFrame({"x": [], "y": []})
gdf = cudf.DataFrame.from_pandas(df)
dgf = dd.from_pandas(gdf, npartitions=npartitions)
return dgf
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems strange that this only returns dgf while _make_random_frame and _make_random_frame_float return df, dgf. Should we symmetrize this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I probably shouldn't have moved this function in the first place since it's being consumed elsewhere and not actually used in my tests. I just moved it back for now.

@brandon-b-miller
Copy link
Contributor Author

@gpucibot merge

@brandon-b-miller
Copy link
Contributor Author

oops, needs dask review.

@@ -3718,6 +3720,68 @@ def apply(

return self._apply(func, _get_row_kernel, *args, **kwargs)

def applymap(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


from dask import dataframe as dd

from .utils import _make_random_frame
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do an absolute import here instead of a relative import so that it is consistent with other imports here and elsewhere in the code-base?

@@ -8,6 +10,8 @@

import cudf

from .utils import _make_random_frame
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here aswell

@rapids-bot rapids-bot bot merged commit ce56bc3 into rapidsai:branch-22.06 Apr 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team dask Dask issue feature request New feature or request non-breaking Non-breaking change numba Numba issue Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants