Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support args= in Series.apply #9982

Merged

Conversation

brandon-b-miller
Copy link
Contributor

@brandon-b-miller brandon-b-miller commented Jan 6, 2022

Closes #9598

A lot of code was moved around but also slightly tweaked, making the diff a little harder to parse. Here's a summary of the changes:

  • Series.apply used to simply turn the incoming scalar lambda function into a row UDF and then turn itself into a dataframe and run the code as normal. Now, it does its own separate unique processing and pipes through Frame._apply instead.
  • pipeline.py was separated out into row_function.py and lambda_function.py which contain whatever is specific to each type of UDF, whereas everything that was common to both was migrated to utils.py and generalized as much as possible.
  • a templates.py area was created to hold all the templates and initializers needed to cat together the kernel that we need and a new template specific to series lambdas was created.
  • The caching machinery was abstracted out into compile_or_get and this function now expects a python function object it can call that will produce the right kernel. DataFrame and Series decide which one to use at the top level API.
  • Moved _apply from Frame to IndexedFrame

@github-actions github-actions bot added the Python Affects Python cuDF API. label Jan 6, 2022
@vyasr
Copy link
Contributor

vyasr commented Jan 6, 2022

@brandon-b-miller I've been looking at this code a bit recently and we discussed some of this refactoring so feel free to explicitly request me whenever you convert this from a draft.

@brandon-b-miller brandon-b-miller marked this pull request as ready for review January 7, 2022 21:51
@brandon-b-miller brandon-b-miller requested a review from a team as a code owner January 7, 2022 21:51
@brandon-b-miller brandon-b-miller added feature request New feature or request non-breaking Non-breaking change numba Numba issue 2 - In Progress Currently a work in progress labels Jan 7, 2022
@brandon-b-miller
Copy link
Contributor Author

I think this is getting there - cc @vyasr

@brandon-b-miller brandon-b-miller removed the 2 - In Progress Currently a work in progress label Jan 11, 2022
@brandon-b-miller
Copy link
Contributor Author

rerun tests

Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question (out of scope for this PR). Is it possible to enable users passing in kernels that they've already compiled, or is that too difficult? And if it is possible, should we at least throw a more friendly error if a user tries that? I remember testing this once and being surprised by the behavior.

python/cudf/cudf/core/indexed_frame.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/series.py Show resolved Hide resolved
python/cudf/cudf/core/udf/row_function.py Show resolved Hide resolved
python/cudf/cudf/core/udf/row_function.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/udf/row_function.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/udf/scalar_function.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/udf/scalar_function.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/udf/templates.py Show resolved Hide resolved
python/cudf/cudf/core/udf/utils.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/udf/utils.py Outdated Show resolved Hide resolved
@shwina shwina changed the base branch from branch-22.02 to branch-22.04 January 20, 2022 21:23
@codecov
Copy link

codecov bot commented Jan 24, 2022

Codecov Report

Merging #9982 (0af638f) into branch-22.04 (e24fa8f) will increase coverage by 0.03%.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff                @@
##           branch-22.04    #9982      +/-   ##
================================================
+ Coverage         10.37%   10.41%   +0.03%     
================================================
  Files               119      122       +3     
  Lines             20149    20629     +480     
================================================
+ Hits               2091     2148      +57     
- Misses            18058    18481     +423     
Impacted Files Coverage Δ
python/cudf/cudf/errors.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/csv.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/hdf.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/orc.py 0.00% <0.00%> (ø)
python/cudf/cudf/__init__.py 0.00% <0.00%> (ø)
python/cudf/cudf/_version.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/abc.py 0.00% <0.00%> (ø)
python/cudf/cudf/api/types.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/dlpack.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/frame.py 0.00% <0.00%> (ø)
... and 64 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3e2474d...0af638f. Read the comment docs.

Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a couple of small questions, but I think this looks mostly good from my end. Happy to move this forward once other reviewers are happy.

python/cudf/cudf/core/indexed_frame.py Outdated Show resolved Hide resolved
offsets.append(col.offset)
launch_args += offsets
launch_args += list(args)
kernel.forall(len(self))(*launch_args)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we always generate a kernel with len(self) tasks, do we really need to pass len(self) as one of the launch_args? AFAICT that's just used to avoid out of bounds accesses, but it looks like we always launch a grid with one thread per row right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had taken the numba docs for forall to mean that inserting this check was a requirement for kernels that expect to be configured this way. Indeed, taking it out leads to numerous tests failing due to nulls being in the wrong places everywhere. My assumption was that something was happening inside forall that caused unpredictable behavior if this guard was not included. cc @gmarkall for more insight.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess maybe numba must be doing something where it generates only a limited set of kernels (maybe templates?) based on the block size and then dispatches to the closest size possible based on the argument to forall? I would be curious to learn more about how this works from @gmarkall. It sounds like you should ignore my suggestion to change anything here though.

Copy link
Contributor

@isVoid isVoid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stellar work. I feel like reading a prose with current naming of functions.

python/cudf/cudf/core/udf/utils.py Show resolved Hide resolved
@brandon-b-miller
Copy link
Contributor Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 896564a into rapidsai:branch-22.04 Jan 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team feature request New feature or request non-breaking Non-breaking change numba Numba issue Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Support args= in cudf.Series.apply
3 participants