Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Support args= in cudf.Series.apply #9598

Closed
brandon-b-miller opened this issue Nov 4, 2021 · 1 comment · Fixed by #9982
Closed

[FEA] Support args= in cudf.Series.apply #9598

brandon-b-miller opened this issue Nov 4, 2021 · 1 comment · Fixed by #9982
Assignees
Labels
feature request New feature or request numba Numba issue Python Affects Python cuDF API.

Comments

@brandon-b-miller
Copy link
Contributor

Is your feature request related to a problem? Please describe.
As a follow up to https://github.com/rapidsai/cudf/pull/9514/files we should support functions that accept scalar (non column) arguments in cudf.Series.apply, similar to pandas. Right now, cudf.Series.apply works by turning the series into a full dataframe and wrapping the incoming function as a row function in a lambda, as seen here. This is all fine and good as long as the UDF always accepts one argument, but breaks down if we want args. As a note functions written for pandas.Series.apply are not row udfs and are written in scalar form:

def f(x):
    return x + 2

vs the row version, which would work on a single column dataframe

def f(x):
    x = row['x']
    return x + 2

Describe the solution you'd like
We want this to work, so we either need to:

  1. come up with a more general mechanism to transform the scalar UDF into a row UDF and then play the same game of promoting the series to a dataframe/ forwarding to cudf.DataFrame.apply
  2. write a separate kernel that works for series and reuses as much of the row compilation machinery as possible.

Ultimately though we want to be able to use UDFs that look like this on Series objects:

def f(x, c):
    return x + c

sr.apply(f, args=(42,))

Describe alternatives you've considered
One can always just promote the series to a single column dataframe and write a row UDF instead as a workaround, but that is rather suboptimal and clumsy for the user.

Additional context
N/A

@brandon-b-miller brandon-b-miller added feature request New feature or request numba Numba issue Python Affects Python cuDF API. labels Nov 4, 2021
@brandon-b-miller brandon-b-miller self-assigned this Nov 4, 2021
@github-actions
Copy link

github-actions bot commented Dec 4, 2021

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

rapids-bot bot pushed a commit that referenced this issue Jan 28, 2022
Closes #9598

A lot of code was moved around but also slightly tweaked, making the diff a little harder to parse. Here's a summary of the changes:

- `Series.apply` used to simply turn the incoming scalar lambda function into a row UDF and then turn itself into a dataframe and run the code as normal. Now, it does its own separate unique processing and pipes through `Frame._apply` instead.
- `pipeline.py` was separated out into `row_function.py` and `lambda_function.py` which contain whatever is specific to each type of UDF, whereas everything that was common to both was migrated to `utils.py` and generalized as much as possible.
- a `templates.py` area was created to hold all the templates and initializers needed to cat together the kernel that we need and a new template specific to series lambdas was created.
- The caching machinery was abstracted out into `compile_or_get` and this function now expects a python function object it can call that will produce the right kernel. `DataFrame` and `Series` decide which one to use at the top level API. 
- Moved `_apply` from `Frame` to `IndexedFrame`

Authors:
  - https://github.com/brandon-b-miller

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Michael Wang (https://github.com/isVoid)

URL: #9982
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request numba Numba issue Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant