-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add cudf.DataFrame.applymap
#10542
Add cudf.DataFrame.applymap
#10542
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few small suggestions.
python/cudf/cudf/core/dataframe.py
Outdated
func : callable | ||
Python function, returns a single value from a single value. | ||
na_action : {None, 'ignore'}, default None | ||
If ``ignore``, propagate NaN values, without passing them to func. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use quotes here, not code font.
If ``ignore``, propagate NaN values, without passing them to func. | |
If 'ignore', propagate NaN values, without passing them to func. |
python/cudf/cudf/core/dataframe.py
Outdated
""" | ||
|
||
if kwargs: | ||
raise ValueError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we usually raise NotImplementedError
for this kind of thing, and ValueError
for invalid values (like na_action not in {"ignore", None}
below).
raise ValueError( | |
raise NotImplementedError( |
Codecov Report
@@ Coverage Diff @@
## branch-22.06 #10542 +/- ##
================================================
+ Coverage 86.33% 86.38% +0.04%
================================================
Files 140 142 +2
Lines 22289 22338 +49
================================================
+ Hits 19244 19296 +52
+ Misses 3045 3042 -3
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a few minor (non-blocking) suggestions. This looks good overall!
df = pd.DataFrame({"x": [], "y": []}) | ||
gdf = cudf.DataFrame.from_pandas(df) | ||
dgf = dd.from_pandas(gdf, npartitions=npartitions) | ||
return dgf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems strange that this only returns dgf
while _make_random_frame
and _make_random_frame_float
return df, dgf
. Should we symmetrize this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I probably shouldn't have moved this function in the first place since it's being consumed elsewhere and not actually used in my tests. I just moved it back for now.
@gpucibot merge |
oops, needs dask review. |
@@ -3718,6 +3720,68 @@ def apply( | |||
|
|||
return self._apply(func, _get_row_kernel, *args, **kwargs) | |||
|
|||
def applymap( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also add this entry to this section of docs: https://github.com/rapidsai/cudf/blob/branch-22.06/docs/cudf/source/api_docs/dataframe.rst#function-application-groupby--window
|
||
from dask import dataframe as dd | ||
|
||
from .utils import _make_random_frame |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we do an absolute import here instead of a relative import so that it is consistent with other imports here and elsewhere in the code-base?
@@ -8,6 +10,8 @@ | |||
|
|||
import cudf | |||
|
|||
from .utils import _make_random_frame |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here aswell
Naive implementation of
DataFrame.applymap
that just callsapply
in a loop over columns.This could theoretically be made much faster within our framework. This requires at worst
N
compilations andM
kernel launches, whereN
is the number of different dtypes in the data, andM
is the number of total columns. We could however as an improvement to this launch just one kernel that populates the entire output data. This would still suffer from the compilation bottleneck however, since the function must be compiled in order for an output dtype to be determined, and this will need to be done for each distinct dtype within the data.Part of #10169