Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Series apply method backed by masked UDFs #9217

Merged
merged 60 commits into from
Oct 1, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
60 commits
Select commit Hold shift + click to select a range
2955bb3
initial
brandon-b-miller Aug 25, 2021
8d98b73
impl
brandon-b-miller Aug 30, 2021
45c47ee
purge c++ code
brandon-b-miller Aug 30, 2021
c2e0218
enable cuda 11.0
brandon-b-miller Aug 30, 2021
69f92cf
enable tests for __pow__
brandon-b-miller Aug 30, 2021
b636af9
solve multiple problems
brandon-b-miller Aug 31, 2021
bdf5823
masks are required for entry
brandon-b-miller Sep 2, 2021
470d25e
support returning a single number
brandon-b-miller Sep 3, 2021
e271ce4
formatting
brandon-b-miller Sep 3, 2021
0a971f0
bugfix
brandon-b-miller Sep 3, 2021
2f7e6f8
remove header
brandon-b-miller Sep 3, 2021
13a94cb
fix bool typing
brandon-b-miller Sep 3, 2021
b2a68e6
template kernels
brandon-b-miller Sep 3, 2021
5cb75e7
switch back to forall
brandon-b-miller Sep 3, 2021
49d9978
implement construct_signature
brandon-b-miller Sep 3, 2021
2ba8bd2
support offsets
brandon-b-miller Sep 3, 2021
11b2fd1
cache kernels
brandon-b-miller Sep 3, 2021
7379fe1
merge latest
brandon-b-miller Sep 7, 2021
775dd57
style
brandon-b-miller Sep 7, 2021
04c38e6
skip cases where pandas null logic differs
brandon-b-miller Sep 8, 2021
7a01bdb
style
brandon-b-miller Sep 8, 2021
627d197
update tests slightly
brandon-b-miller Sep 8, 2021
d3e2e0b
updates to pipeline.py
brandon-b-miller Sep 8, 2021
394fad3
Merge branch 'branch-21.10' into fea-masked-udf-pure-python
brandon-b-miller Sep 10, 2021
786c283
merge masked udf python only branch and resolve conflicts
brandon-b-miller Sep 10, 2021
69aec24
plumbing
brandon-b-miller Sep 10, 2021
306f5e1
address many reviews
brandon-b-miller Sep 13, 2021
05adec7
cleanup
brandon-b-miller Sep 13, 2021
edbae6c
minor updtes
brandon-b-miller Sep 13, 2021
6442642
merge latest from other branch
brandon-b-miller Sep 13, 2021
e224bee
Apply suggestions from code review
brandon-b-miller Sep 14, 2021
a446b75
address reviews
brandon-b-miller Sep 14, 2021
7fd55a6
Merge branch 'fea-masked-udf-pure-python' into fea-series-apply
brandon-b-miller Sep 14, 2021
b54e11e
remove creating buffers if the column has no mask
brandon-b-miller Sep 14, 2021
16406ff
put buffer back in for blank mask for now
brandon-b-miller Sep 14, 2021
0ce663e
Merge branch 'fea-masked-udf-pure-python' into fea-series-apply
brandon-b-miller Sep 14, 2021
ba2d898
merge latest and resolve conflicts
brandon-b-miller Sep 16, 2021
e08ebc5
Merge branch 'fea-masked-udf-pure-python' into fea-series-apply
brandon-b-miller Sep 16, 2021
30d6013
fix import bug
brandon-b-miller Sep 17, 2021
a369641
clarify exec context
brandon-b-miller Sep 17, 2021
e51d780
Merge branch 'branch-21.10' into fea-masked-udf-pure-python
brandon-b-miller Sep 17, 2021
54b3fca
Merge branch 'fea-masked-udf-pure-python' into fea-series-apply
brandon-b-miller Sep 17, 2021
3c0c76f
rework unmasked kernels slightly
brandon-b-miller Sep 17, 2021
6deb96a
un purge c++
brandon-b-miller Sep 21, 2021
51b4fc9
cpp cleanup
brandon-b-miller Sep 21, 2021
8a9001b
Merge branch 'fea-masked-udf-pure-python' into fea-series-apply
brandon-b-miller Sep 21, 2021
0b532e2
account for differing logic vs pandas cases
brandon-b-miller Sep 21, 2021
d37ef4b
docs and style
brandon-b-miller Sep 21, 2021
4249334
Merge branch 'branch-21.10' into fea-masked-udf-pure-python
brandon-b-miller Sep 21, 2021
226a24d
Merge branch 'fea-masked-udf-pure-python' into fea-series-apply
brandon-b-miller Sep 21, 2021
9f3c60e
Merge branch 'branch-21.10' into fea-masked-udf-pure-python
brandon-b-miller Sep 23, 2021
9be588d
Merge branch 'fea-masked-udf-pure-python' into fea-series-apply
brandon-b-miller Sep 23, 2021
6dc7677
cleanup
brandon-b-miller Sep 23, 2021
71c71b8
address reviews
brandon-b-miller Sep 28, 2021
71bce8e
merge latest
brandon-b-miller Sep 28, 2021
b0580e9
Merge branch 'branch-21.12' into fea-masked-udf-pure-python
brandon-b-miller Sep 29, 2021
6df9f25
Merge branch 'fea-masked-udf-pure-python' into fea-series-apply
brandon-b-miller Sep 29, 2021
21a7af5
merge, remove frame.dtypes temporary property
brandon-b-miller Sep 29, 2021
68fc62b
merge 21.12
brandon-b-miller Sep 29, 2021
232beb4
style
brandon-b-miller Sep 29, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
88 changes: 88 additions & 0 deletions python/cudf/cudf/core/series.py
Original file line number Diff line number Diff line change
Expand Up @@ -3419,6 +3419,94 @@ def _return_sentinel_series():
return codes

# UDF related
def apply(self, func, convert_dtype=True, args=(), **kwargs):
"""
Apply a scalar function to the values of a Series.

Similar to `pandas.Series.apply. Applies a user
defined function elementwise over a series.

Parameters
----------
func : function
Scalar Python function to apply.
convert_dtype : bool, default True
In cuDF, this parameter is always True. Because
cuDF does not support arbitrary object dtypes,
the result will always be the common type as determined
by numba based on the function logic and argument types.
See examples for details.
args : tuple
Not supported
**kwargs
Not supported

Notes
-----
UDFs are cached in memory to avoid recompilation. The first
call to the UDF will incur compilation overhead.

Examples
--------

Apply a basic function to a series
>>> sr = cudf.Series([1,2,3])
>>> def f(x):
... return x + 1
>>> sr.apply(f)
0 2
1 3
2 4
dtype: int64

Apply a basic function to a series with nulls
>>> sr = cudf.Series([1,cudf.NA,3])
>>> def f(x):
... return x + 1
>>> sr.apply(f)
0 2
1 <NA>
2 4
dtype: int64

Use a function that does something conditionally,
based on if the value is or is not null
>>> sr = cudf.Series([1,cudf.NA,3])
>>> def f(x):
... if x is cudf.NA:
... return 42
... else:
... return x - 1
>>> sr.apply(f)
0 0
1 42
2 2
dtype: int64

Results will be upcast to the common dtype required
as derived from the UDFs logic. Note that this means
the common type will be returned even if such data
is passed that would not result in any values of that
dtype.

>>> sr = cudf.Series([1,cudf.NA,3])
>>> def f(x):
... return x + 1.5
>>> sr.apply(f)
0 2.5
1 <NA>
2 4.5
dtype: float64



"""
if args or kwargs:
raise ValueError(
"UDFs using *args or **kwargs are not yet supported."
)

return super()._apply(func)

def applymap(self, udf, out_dtype=None):
"""Apply an elementwise function to transform the values in the Column.
Expand Down
7 changes: 4 additions & 3 deletions python/cudf/cudf/core/udf/pipeline.py
Original file line number Diff line number Diff line change
Expand Up @@ -130,7 +130,6 @@ def _kernel(retval, {input_columns}, {input_offsets}, size):
def _define_function(df, scalar_return=False):
# Create argument list for kernel
input_columns = ", ".join([f"input_col_{i}" for i in range(len(df._data))])

input_offsets = ", ".join([f"offset_{i}" for i in range(len(df._data))])

# Create argument list to pass to device function
Expand Down Expand Up @@ -177,15 +176,17 @@ def compile_or_get(df, f):
"""

# check to see if we already compiled this function
frame_dtypes = tuple(col.dtype for col in df._data.values())
cache_key = (
*cudautils.make_cache_key(f, tuple(df.dtypes)),
*cudautils.make_cache_key(f, frame_dtypes),
*(col.mask is None for col in df._data.values()),
)
if precompiled.get(cache_key) is not None:
kernel, scalar_return_type = precompiled[cache_key]
return kernel, scalar_return_type

numba_return_type = get_udf_return_type(f, df.dtypes)
numba_return_type = get_udf_return_type(f, frame_dtypes)

_is_scalar_return = not isinstance(numba_return_type, MaskedType)
scalar_return_type = (
numba_return_type
Expand Down
116 changes: 116 additions & 0 deletions python/cudf/cudf/tests/test_udf_masked_ops.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,15 @@ def run_masked_udf_test(func_pdf, func_gdf, data, **kwargs):
assert_eq(expect, obtain, **kwargs)


def run_masked_udf_series(func_psr, func_gsr, data, **kwargs):
gsr = data
psr = data.to_pandas(nullable=True)

expect = psr.apply(func_psr)
obtain = gsr.apply(func_gsr)
assert_eq(expect, obtain, **kwargs)


@pytest.mark.parametrize("op", arith_ops)
def test_arith_masked_vs_masked(op):
# This test should test all the typing
Expand Down Expand Up @@ -314,3 +323,110 @@ def func_gdf(w, x, y, z):
}
)
run_masked_udf_test(func_pdf, func_gdf, gdf, check_dtype=False)


###


@pytest.mark.parametrize(
"data", [cudf.Series([1, 2, 3]), cudf.Series([1, cudf.NA, 3])]
)
def test_series_apply_basic(data):
def func(x):
return x + 1

run_masked_udf_series(func, func, data, check_dtype=False)


def test_series_apply_null_conditional():
def func_pdf(x):
if x is pd.NA:
return 42
else:
return x - 1

def func_gdf(x):
if x is cudf.NA:
return 42
else:
return x - 1

data = cudf.Series([1, cudf.NA, 3])

run_masked_udf_series(func_pdf, func_gdf, data)


###


@pytest.mark.parametrize("op", arith_ops)
def test_series_arith_masked_vs_masked(op):
def func(x):
return op(x, x)

data = cudf.Series([1, cudf.NA, 3])
run_masked_udf_series(func, func, data, check_dtype=False)


@pytest.mark.parametrize("op", comparison_ops)
def test_series_compare_masked_vs_masked(op):
"""
In the series case, only one other MaskedType to compare with
- itself
"""

def func(x):
return op(x, x)

data = cudf.Series([1, cudf.NA, 3])
run_masked_udf_series(func, func, data, check_dtype=False)


@pytest.mark.parametrize("op", arith_ops)
@pytest.mark.parametrize("constant", [1, 1.5, cudf.NA])
def test_series_arith_masked_vs_constant(op, constant):
def func(x):
return op(x, constant)

# Just a single column -> result will be all NA
data = cudf.Series([1, 2, cudf.NA])
if constant is cudf.NA and op is operator.pow:
# in pandas, 1**NA == 1. In cudf, 1**NA == 1.
with pytest.xfail():
run_masked_udf_series(func, func, data, check_dtype=False)
return
run_masked_udf_series(func, func, data, check_dtype=False)


@pytest.mark.parametrize("op", arith_ops)
@pytest.mark.parametrize("constant", [1, 1.5, cudf.NA])
def test_series_arith_masked_vs_constant_reflected(op, constant):
def func(x):
return op(constant, x)

# Just a single column -> result will be all NA
data = cudf.Series([1, 2, cudf.NA])
if constant is not cudf.NA and constant == 1 and op is operator.pow:
# in pandas, 1**NA == 1. In cudf, 1**NA == 1.
with pytest.xfail():
run_masked_udf_series(func, func, data, check_dtype=False)
return
run_masked_udf_series(func, func, data, check_dtype=False)


def test_series_masked_is_null_conditional():
def func_psr(x):
if x is pd.NA:
return 42
else:
return x

def func_gsr(x):
if x is cudf.NA:
return 42
else:
return x

data = cudf.Series([1, cudf.NA, 3, cudf.NA])

run_masked_udf_series(func_psr, func_gsr, data, check_dtype=False)