Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement groupby apply with JIT #11452

Merged
merged 157 commits into from
Jan 27, 2023
Merged
Show file tree
Hide file tree
Changes from 96 commits
Commits
Show all changes
157 commits
Select commit Hold shift + click to select a range
8db918f
Groupby Apply with JIT (First Commit)
bwyogatama Aug 3, 2022
2d6b4c9
Fix error in Pytest
bwyogatama Aug 4, 2022
fd8680e
JIT Caching Support
bwyogatama Aug 5, 2022
9220658
Add IdxMax and IdxMin
bwyogatama Aug 5, 2022
f4bc7c4
Add IdxMax and IdxMin
bwyogatama Aug 5, 2022
8659149
Dynamic Launch Parameter
bwyogatama Aug 14, 2022
b7ede43
Code cleanup #1
bwyogatama Aug 19, 2022
1dbbb77
Merge remote-tracking branch 'origin/branch-22.10' into fea-groupby-a…
vyasr Sep 23, 2022
f98fc63
Add support for building the JIT functions with the rest of the build.
vyasr Sep 23, 2022
11edd37
Make engine name consistent with tests
vyasr Sep 23, 2022
1e12416
Generalize compiled PTX selection for CUDA arch.
vyasr Sep 23, 2022
d348fb8
Cleanup of strings_udf PTX detection
vyasr Sep 23, 2022
795e580
Fix tests with some hacks so that we can start validating.
vyasr Sep 23, 2022
0ce0a90
Standardize the engine argument handling so that we get clear errors.
vyasr Sep 23, 2022
3493d49
Update style.
vyasr Sep 24, 2022
7f9ea1f
Refactoring C++ function
bwyogatama Oct 26, 2022
3d76a44
Fix bug in C++ and Python Cleanup
Nov 2, 2022
ad878ac
merge latest, resolve conflicts, pass tests
brandon-b-miller Nov 21, 2022
d876ad7
pass style, cleanup
brandon-b-miller Nov 21, 2022
f600196
start to move files
brandon-b-miller Nov 21, 2022
6cbdaf8
starting to refactor
brandon-b-miller Nov 22, 2022
3a11fe1
continue to refactor typing
brandon-b-miller Nov 28, 2022
8161548
move lowering to its own file
brandon-b-miller Nov 28, 2022
52656ab
continue refactoring idxmin and idxmax functions
brandon-b-miller Nov 28, 2022
b9096f3
add tests for idxmin and idxmax, not currently passing
brandon-b-miller Nov 28, 2022
d21a099
normalize call_cuda_functions keys
brandon-b-miller Nov 28, 2022
62aad1e
continued refactoring
brandon-b-miller Nov 28, 2022
9ff058a
refactoring lowering
brandon-b-miller Nov 28, 2022
5f07ca2
continued refactoring
brandon-b-miller Nov 29, 2022
38c3560
Merge branch 'branch-23.02' into groupby-apply-updates
brandon-b-miller Dec 5, 2022
c12a9e3
CMake changes
bwyogatama Jan 2, 2023
e650c21
C++ changes
bwyogatama Jan 2, 2023
0fd6e22
Merge branch 'branch-23.02' into fea-groupby-apply-jit
brandon-b-miller Jan 5, 2023
aabdc5f
Merge branch 'branch-23.02' into fea-groupby-apply-jit
brandon-b-miller Jan 6, 2023
301eea1
style
brandon-b-miller Jan 6, 2023
e50f4a6
found the bug
brandon-b-miller Jan 8, 2023
df1485d
minor refactoring
brandon-b-miller Jan 8, 2023
353078c
Merge branch 'branch-23.02' into fea-groupby-apply-jit
brandon-b-miller Jan 8, 2023
14fe3cb
update/add tests
brandon-b-miller Jan 9, 2023
f7791b4
continue refactoring tests
brandon-b-miller Jan 9, 2023
7f63c90
add docs, switch to partials
brandon-b-miller Jan 12, 2023
902223a
continue addressing reviews
brandon-b-miller Jan 17, 2023
78f8b6f
Update python/cudf/cudf/core/udf/groupby_typing.py
brandon-b-miller Jan 17, 2023
611b864
Merge branch 'fea-groupby-apply-jit' of github.com:bwyogatama/cudf in…
brandon-b-miller Jan 17, 2023
2849680
merge remote
brandon-b-miller Jan 17, 2023
33109f5
address more reviews
brandon-b-miller Jan 17, 2023
07444eb
move utilities around
brandon-b-miller Jan 17, 2023
39eb8f9
template throughout c++
brandon-b-miller Jan 18, 2023
f30cd8b
Merge branch 'branch-23.02' into fea-groupby-apply-jit
brandon-b-miller Jan 18, 2023
6158cb7
cpp code formatting
PointKernel Jan 18, 2023
93df707
Merge branch 'fea-groupby-apply-jit' of github.com:bwyogatama/cudf in…
brandon-b-miller Jan 18, 2023
5ae896a
style
brandon-b-miller Jan 18, 2023
a42d307
Use cudf size_type
PointKernel Jan 18, 2023
0e0b750
Use std limits instead of macros
PointKernel Jan 18, 2023
e891e5f
merge latest, resolve conflicts, don't unset warnings
brandon-b-miller Jan 19, 2023
8188508
Merge branch 'fea-groupby-apply-jit' of github.com:bwyogatama/cudf in…
brandon-b-miller Jan 19, 2023
95fa402
remove redundant comments
brandon-b-miller Jan 19, 2023
3de3add
style fixes
brandon-b-miller Jan 19, 2023
865bb5d
debug statements
brandon-b-miller Jan 20, 2023
7788944
patch numba linker based off groupby ptx file
brandon-b-miller Jan 20, 2023
bdea84c
Fix idxmin/max bug
PointKernel Jan 20, 2023
0110075
Use static_cast to avoid raw casting
PointKernel Jan 20, 2023
b039ce7
Cleanups: reinterpret_cast + remove redundant sync
PointKernel Jan 20, 2023
a8c3a75
Merge branch 'branch-23.02' into fea-groupby-apply-jit
brandon-b-miller Jan 20, 2023
21792c6
Merge branch 'fea-groupby-apply-jit' of github.com:bwyogatama/cudf in…
brandon-b-miller Jan 20, 2023
14dc674
Simplify block min/max logic
PointKernel Jan 20, 2023
3a48a96
Merge branch 'fea-groupby-apply-jit' of github.com:bwyogatama/cudf in…
PointKernel Jan 20, 2023
9766233
Merge remote-tracking branch 'upstream/branch-23.02' into groupby
PointKernel Jan 20, 2023
99af3f2
Replace custom atomic add with cuda atomic_ref
PointKernel Jan 20, 2023
321fdab
Simplify block sum logic
PointKernel Jan 20, 2023
c91a589
Simplify block var logic
PointKernel Jan 20, 2023
aa47763
Merge branch 'branch-23.02' into fea-groupby-apply-jit
brandon-b-miller Jan 20, 2023
cbc13e6
Refactor block idxmin/idxmax
PointKernel Jan 20, 2023
ab20731
Fix a minor bug
PointKernel Jan 20, 2023
f721f9c
Merge branch 'fea-groupby-apply-jit' of github.com:bwyogatama/cudf in…
PointKernel Jan 20, 2023
a24f09e
Fix the floating point min value bug
PointKernel Jan 20, 2023
d908621
Refactor with CUDA CG
PointKernel Jan 20, 2023
595746a
C++ changes: (1) Addressing more reviewer's comment, (2) Replacing cu…
bwyogatama Jan 22, 2023
9a93af7
Merge branch 'branch-23.02' into fea-groupby-apply-jit
brandon-b-miller Jan 23, 2023
9d3c431
Merge branch 'fea-groupby-apply-jit' of github.com:bwyogatama/cudf in…
brandon-b-miller Jan 23, 2023
bb8b7c3
style
brandon-b-miller Jan 23, 2023
1f475f0
Use proper cuda thread scope
PointKernel Jan 23, 2023
0bb68bc
Merge remote-tracking branch 'upstream/branch-23.02' into groupby
PointKernel Jan 23, 2023
209188f
Request cpp review for udf_cpp
PointKernel Jan 23, 2023
826ed25
Remove unsupported data types
PointKernel Jan 23, 2023
1cf91ea
Use exclusively thread 0 to write var output + minor cleanups
PointKernel Jan 23, 2023
381dd00
Addressing more reviewer's commment and fix mean bug
bwyogatama Jan 23, 2023
ee87548
error if nulls are present, dont not patch numba linker in strings_udf
brandon-b-miller Jan 23, 2023
1d4edc8
Style
brandon-b-miller Jan 23, 2023
4ff80f4
Merge branch 'fea-groupby-apply-jit' of github.com:bwyogatama/cudf in…
PointKernel Jan 23, 2023
4fe21fb
Replace int with int32_t
PointKernel Jan 23, 2023
3fbe3ff
Remove unused template specializations
PointKernel Jan 23, 2023
b59d31e
Merge remote-tracking branch 'upstream/branch-23.02' into groupby
PointKernel Jan 23, 2023
9af3670
update utility functions to no longer be strings_udf specific
brandon-b-miller Jan 24, 2023
afd0949
tweak thread guard logic in groupby template
brandon-b-miller Jan 24, 2023
c05e889
Merge branch 'branch-23.02' into fea-groupby-apply-jit
brandon-b-miller Jan 24, 2023
1828ef7
Apply suggestions from code review
brandon-b-miller Jan 24, 2023
6489950
fix small bug
brandon-b-miller Jan 24, 2023
9b83d78
refactor group constructor lowering
brandon-b-miller Jan 24, 2023
73a2ba1
partially address reviews
brandon-b-miller Jan 24, 2023
4dfb790
continue addressing reviews
brandon-b-miller Jan 24, 2023
43be944
extraneous copyright
brandon-b-miller Jan 24, 2023
b5f8f63
fix small comment error in cmake
brandon-b-miller Jan 24, 2023
8bbd725
inline _is_jit_supported_type
brandon-b-miller Jan 24, 2023
2df3216
adjust logic in maybe_patch_numba_linker
brandon-b-miller Jan 24, 2023
e8137e3
Apply suggestions from code review
PointKernel Jan 24, 2023
c5e7445
Correct numerical limits
PointKernel Jan 24, 2023
761261c
Apply suggestions from code review
brandon-b-miller Jan 24, 2023
9884897
fix small syntax error
brandon-b-miller Jan 24, 2023
41b42c7
add an updater to update-version.sh
brandon-b-miller Jan 24, 2023
5855f5c
refactor groupby.apply top level impl into separate methods
brandon-b-miller Jan 24, 2023
d6a3ef2
GroupType.size_type -> GroupType.group_size_type
brandon-b-miller Jan 24, 2023
9b60a62
introduce group_size_type as a global
brandon-b-miller Jan 24, 2023
f0a9af8
use index_default_type in idxmax/idxmin lowering
brandon-b-miller Jan 24, 2023
6708655
rename some utility functions and add docs
brandon-b-miller Jan 24, 2023
3e5149d
tweak previous function
brandon-b-miller Jan 24, 2023
c253b8f
Addressing reviewers' comments
bwyogatama Jan 24, 2023
51d7ec7
Merge branch 'fea-groupby-apply-jit' of https://github.com/bwyogatama…
bwyogatama Jan 24, 2023
bae845d
unused import
brandon-b-miller Jan 24, 2023
e91b641
Replace std numeric limits with cudf device operators
PointKernel Jan 24, 2023
0d35554
Merge branch 'fea-groupby-apply-jit' of github.com:bwyogatama/cudf in…
PointKernel Jan 24, 2023
73892e1
add tests for special values
brandon-b-miller Jan 24, 2023
08cbcb7
Merge branch 'fea-groupby-apply-jit' of github.com:bwyogatama/cudf in…
brandon-b-miller Jan 24, 2023
81bfeb1
Apply suggestions from code review
PointKernel Jan 25, 2023
3d76481
Code formatting
PointKernel Jan 25, 2023
97490af
Add more special value tests
PointKernel Jan 25, 2023
43694f7
Fix bugs when all values are nans
PointKernel Jan 25, 2023
928d404
Fix a result init bug in are_all_nans
PointKernel Jan 25, 2023
8079047
separate out idmax and idxmin tests with special values
brandon-b-miller Jan 25, 2023
11c0eb6
remove redundant tests
brandon-b-miller Jan 25, 2023
3a5afa6
answer is the offsets not just an array of zeroes
brandon-b-miller Jan 25, 2023
4d719b5
dynamically register reductions
brandon-b-miller Jan 25, 2023
5c5e37c
Add corner cases handling to idxmin/idxmax
PointKernel Jan 25, 2023
fac8d70
Apply suggestions from code review
PointKernel Jan 25, 2023
2f9cc76
Remove unroll pragma
PointKernel Jan 25, 2023
961b3b9
Merge branch 'fea-groupby-apply-jit' of github.com:bwyogatama/cudf in…
PointKernel Jan 25, 2023
5e6d4aa
Merge remote-tracking branch 'upstream/branch-23.02' into groupby
PointKernel Jan 25, 2023
6665ef9
address remaining reviews
brandon-b-miller Jan 26, 2023
62b5a99
Merge branch 'fea-groupby-apply-jit' of github.com:bwyogatama/cudf in…
brandon-b-miller Jan 26, 2023
62a8928
fix tests
brandon-b-miller Jan 26, 2023
de6b54c
Code formatting
PointKernel Jan 26, 2023
0c3d5a0
go back to index_default_type = types.int64 explicitly
brandon-b-miller Jan 26, 2023
b0e8c29
Merge branch 'fea-groupby-apply-jit' of github.com:bwyogatama/cudf in…
brandon-b-miller Jan 26, 2023
5a5c0fb
merge resolve conflicts
brandon-b-miller Jan 26, 2023
5db0b6c
style
brandon-b-miller Jan 26, 2023
7e2ca13
Update python/cudf/udf_cpp/groupby/CMakeLists.txt
brandon-b-miller Jan 26, 2023
40b8ce9
Cast mean results to double
PointKernel Jan 26, 2023
bb8e0c3
Merge branch 'fea-groupby-apply-jit' of github.com:bwyogatama/cudf in…
PointKernel Jan 26, 2023
0952784
Merge remote-tracking branch 'upstream/branch-23.02' into groupby
PointKernel Jan 26, 2023
e0d0230
Merge branch 'branch-23.02' into fea-groupby-apply-jit
PointKernel Jan 26, 2023
81860c5
address reviews
brandon-b-miller Jan 27, 2023
2f352bc
minor edits
brandon-b-miller Jan 27, 2023
0b407c8
Compute blockstd via blockvar
PointKernel Jan 27, 2023
568ab97
Merge branch 'fea-groupby-apply-jit' of github.com:bwyogatama/cudf in…
PointKernel Jan 27, 2023
83f8d88
Merge remote-tracking branch 'upstream/branch-23.02' into groupby
PointKernel Jan 27, 2023
dbd5eeb
Use atomic operations to avoid concurrent writes
PointKernel Jan 27, 2023
eaa8ff7
Use int64_t atomic ref
PointKernel Jan 27, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .github/CODEOWNERS
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
#cpp code owners
cpp/ @rapidsai/cudf-cpp-codeowners
cpp/ @rapidsai/cudf-cpp-codeowners
python/cudf/udf_cpp/ @rapidsai/cudf-cpp-codeowners

#python code owners
python/ @rapidsai/cudf-python-codeowners
Expand Down
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ python/cudf/*/_cuda/*.cpp
python/cudf/*.ipynb
python/cudf/.ipynb_checkpoints
python/*/record.txt
python/cudf/cudf/core/udf/*.ptx
python/cudf_kafka/*/_lib/**/*.cpp
python/cudf_kafka/*/_lib/**/*.h
python/custreamz/*/_lib/**/*.cpp
Expand Down
11 changes: 9 additions & 2 deletions python/cudf/CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# =============================================================================
# Copyright (c) 2022, NVIDIA CORPORATION.
# Copyright (c) 2022-2023, NVIDIA CORPORATION.
#
# Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except
# in compliance with the License. You may obtain a copy of the License at
Expand All @@ -17,6 +17,8 @@ cmake_minimum_required(VERSION 3.23.1 FATAL_ERROR)
set(cudf_version 23.02.00)

include(../../fetch_rapids.cmake)
include(rapids-cuda)
rapids_cuda_init_architectures(cudf-python)

project(
cudf-python
Expand All @@ -25,7 +27,11 @@ project(
# language to be enabled here. The test project that is built in scikit-build to verify
# various linking options for the python library is hardcoded to build with C, so until
# that is fixed we need to keep C.
C CXX
C
CXX
# Temporarily enabling for groupby UDFs compilation until we come up with a better
brandon-b-miller marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do two things?

  1. Open an issue tracking this temporary problem.
  2. Mark with
Suggested change
# Temporarily enabling for groupby UDFs compilation until we come up with a better
# TODO: Temporarily enabling for groupby UDFs compilation until we come up with a better

So that this doesn't disappear into the mists of time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this even a problem anymore? It's a hard requirement now and needs to be enabled to build the python package, so I'd assume it belongs here indefinitely at this point unless this is hacky for some reason.

# solution.
brandon-b-miller marked this conversation as resolved.
Show resolved Hide resolved
CUDA
brandon-b-miller marked this conversation as resolved.
Show resolved Hide resolved
)

option(FIND_CUDF_CPP "Search for existing CUDF C++ installations before defaulting to local files"
Expand Down Expand Up @@ -117,6 +123,7 @@ endif()
rapids_cython_init()

add_subdirectory(cudf/_lib)
add_subdirectory(udf_cpp/groupby)

include(cmake/Modules/ProtobufHelpers.cmake)
codegen_protoc(cudf/utils/metadata/orc_column_statistics.proto)
Expand Down
11 changes: 9 additions & 2 deletions python/cudf/cudf/__init__.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,9 @@
# Copyright (c) 2018-2022, NVIDIA CORPORATION.
# Copyright (c) 2018-2023, NVIDIA CORPORATION.

from cudf.utils.gpu_utils import validate_setup

validate_setup()
import os
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused import. (I think flake8 might be disabled on this file... it should have caught that.)

Suggested change
import os


import cupy
from numba import config as numba_config, cuda
Expand Down Expand Up @@ -88,7 +89,13 @@
pass
else:
# Patch Numba to support CUDA enhanced compatibility.
patch_numba_linker_if_needed()
# cuDF requires a stronger set of conditions than what is
# checked by patch_numba_linker_if_needed due to the PTX
# files needed for JIT Groupby Apply and string UDFs
from cudf.core.udf.utils import _setup_numba_linker

_setup_numba_linker(os.path.dirname(__file__) + "/core/udf/", "function_")

del patch_numba_linker_if_needed

cuda.set_memory_manager(rmm.RMMNumbaManager)
Expand Down
57 changes: 40 additions & 17 deletions python/cudf/cudf/core/groupby/groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@
from cudf.core.column_accessor import ColumnAccessor
from cudf.core.mixins import Reducible, Scannable
from cudf.core.multiindex import MultiIndex
from cudf.core.udf.groupby_utils import jit_groupby_apply
from cudf.utils.utils import GetAttrGetItemMixin, _cudf_nvtx_annotate


Expand Down Expand Up @@ -786,14 +787,19 @@ def pipe(self, func, *args, **kwargs):
"""
return cudf.core.common.pipe(self, func, *args, **kwargs)

def apply(self, function, *args):
def apply(self, function, *args, engine="cudf"):
brandon-b-miller marked this conversation as resolved.
Show resolved Hide resolved
bdice marked this conversation as resolved.
Show resolved Hide resolved
"""Apply a python transformation function over the grouped chunk.

Parameters
----------
func : function
brandon-b-miller marked this conversation as resolved.
Show resolved Hide resolved
The python transformation function that will be applied
on the grouped chunk.
engine: {'cudf', 'jit'}, default 'cudf'
Selects the GroupBy.apply implementation. Use `jit` to
select the numba JIT pipeline.
For more information, see the `cuDF guide to user defined functions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds like this link should explain the Numba JIT pipeline, but it doesn't. Will we update that notebook in this PR (or before the 23.02 release)?

If we don't have time to update that notebook, maybe we can explain the choice of 'cudf' or 'jit' a little more in this docstring. Currently I don't think users will know enough to decide which is appropriate.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There will definitely be a notebook update asap! It's in progress. I'll expound a bit here regardless though.

<https://docs.rapids.ai/api/cudf/stable/user_guide/guide-to-udfs.html>`__.

Examples
--------
Expand Down Expand Up @@ -855,25 +861,40 @@ def mult(df):
raise TypeError(f"type {type(function)} is not callable")
group_names, offsets, group_keys, grouped_values = self._grouped()

ngroups = len(offsets) - 1
if ngroups > self._MAX_GROUPS_BEFORE_WARN:
warnings.warn(
f"GroupBy.apply() performance scales poorly with "
f"number of groups. Got {ngroups} groups."
)

chunks = [
grouped_values[s:e] for s, e in zip(offsets[:-1], offsets[1:])
]
chunk_results = [function(chk, *args) for chk in chunks]
if not len(chunk_results):
return self.obj.head(0)
if engine == "jit":
bdice marked this conversation as resolved.
Show resolved Hide resolved
# Nulls are not yet supported
for colname in self.grouping.values._data.keys():
if self.obj._data[colname].has_nulls():
raise ValueError(
"Nulls not yet supported with groupby JIT engine"
)

if cudf.api.types.is_scalar(chunk_results[0]):
chunk_results = jit_groupby_apply(
wence- marked this conversation as resolved.
Show resolved Hide resolved
offsets, grouped_values, function, *args
)
result = cudf.Series(chunk_results, index=group_names)
result.index.names = self.grouping.names
else:
if isinstance(chunk_results[0], cudf.Series) and isinstance(
result = result.reset_index()
wence- marked this conversation as resolved.
Show resolved Hide resolved
result[None] = result.pop(0)
elif engine == "cudf":
ngroups = len(offsets) - 1
if ngroups > self._MAX_GROUPS_BEFORE_WARN:
warnings.warn(
brandon-b-miller marked this conversation as resolved.
Show resolved Hide resolved
f"GroupBy.apply() performance scales poorly with "
f"number of groups. Got {ngroups} groups."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we update this message with a suggestion to try using engine="jit"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a note about this.

)

chunks = [
grouped_values[s:e] for s, e in zip(offsets[:-1], offsets[1:])
]
chunk_results = [function(chk, *args) for chk in chunks]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We know this is slow because it involves a lot of Python invocations of GPU kernels when there are many groups. However, have we tried using some thread parallelism or other means to reduce/conceal that overhead? Out of scope for this PR but I would like to know that we've made an attempt at improving this code path since we can't always use "jit".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've tried streams and a number of other approaches. @shwina led the charge on this a while back I believe. I think the result was that the python overhead was dominating the runtime regardless of being able to parallelize over kernel launches somewhat. @wence- had some more advanced ideas we brainstormed earlier this year for a more holistic groupby.apply combining a few elements from across our learnings I believe. Maybe time to restart that conversation. Ultimately though the lack of traction with any previous approach is what led to this PR in the first place.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As Brandon says, the fact that our previous attempts didn't really improve much is the reason we ended up moving towards JIT. @isVoid put together the streams prototype IIRC. To really improve the performance we'll need to reduce Python overhead all the way down the call stack.

if not len(chunk_results):
return self.obj.head(0)

if cudf.api.types.is_scalar(chunk_results[0]):
result = cudf.Series(chunk_results, index=group_names)
result.index.names = self.grouping.names
elif isinstance(chunk_results[0], cudf.Series) and isinstance(
self.obj, cudf.DataFrame
):
result = cudf.concat(chunk_results, axis=1).T
Expand All @@ -884,6 +905,8 @@ def mult(df):
index_data = group_keys._data.copy(deep=True)
index_data[None] = grouped_values.index._column
result.index = cudf.MultiIndex._from_data(index_data)
else:
raise ValueError(f"Unsupported engine '{engine}'")

if self._sort:
result = result.sort_index()
Expand Down
4 changes: 2 additions & 2 deletions python/cudf/cudf/core/udf/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Copyright (c) 2022, NVIDIA CORPORATION.
# Copyright (c) 2022-2023, NVIDIA CORPORATION.

from functools import lru_cache

Expand All @@ -9,7 +9,7 @@
from cudf.core.udf import api, row_function, utils
from cudf.utils.dtypes import STRING_TYPES

from . import masked_lowering, masked_typing
from . import groupby_lowering, groupby_typing, masked_lowering, masked_typing

_units = ["ns", "ms", "us", "s"]
_datetime_cases = {types.NPDatetime(u) for u in _units}
Expand Down
169 changes: 169 additions & 0 deletions python/cudf/cudf/core/udf/groupby_lowering.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,169 @@
# Copyright (c) 2022-2023, NVIDIA CORPORATION.

from functools import partial

from numba import types
from numba.core import cgutils
from numba.core.extending import lower_builtin
from numba.core.typing import signature as nb_signature
from numba.cuda.cudaimpl import lower as cuda_lower

from cudf.core.udf.groupby_typing import (
SUPPORTED_GROUPBY_NUMBA_TYPES,
Group,
GroupType,
call_cuda_functions,
index_default_type,
)


def lowering_function(context, builder, sig, args, function):
brandon-b-miller marked this conversation as resolved.
Show resolved Hide resolved
"""
Instruction boilerplate used for calling a groupby reduction
__device__ function. Centers around a forward declaration of
this function and adds the pre/post processing instructions
necessary for calling it.
"""
# return type
retty = sig.return_type

# a variable logically corresponding to the calling `Group`
grp = cgutils.create_struct_proxy(sig.args[0])(
context, builder, value=args[0]
)

# what specific (numba) GroupType
grp_type = sig.args[0]
group_dataty = grp_type.group_data_type

# logically take the address of the group's data pointer
group_data_ptr = builder.alloca(grp.group_data.type)
builder.store(grp.group_data, group_data_ptr)

# obtain the correct forward declaration from registry
type_key = (sig.return_type, grp_type.group_scalar_type)
func = call_cuda_functions[function][type_key]

# insert the forward declaration and return its result
# pass it the data pointer and the group's size
return context.compile_internal(
builder,
func,
nb_signature(retty, group_dataty, grp_type.size_type),
(builder.load(group_data_ptr), grp.size),
)


@lower_builtin(Group, types.Array, types.int64, types.Array)
def group_constructor(context, builder, sig, args):
brandon-b-miller marked this conversation as resolved.
Show resolved Hide resolved
"""
Instruction boilerplate used for instantiating a Group
struct from a data pointer, an index pointer, and a size
"""

group_data, size, index = args

# a variable logically corresponding to the calling `Group`
grp = cgutils.create_struct_proxy(sig.return_type)(context, builder)

# the group data array and its pointer
arr_group_data = cgutils.create_struct_proxy(sig.args[0])(
context, builder, value=group_data
)
group_data_ptr = arr_group_data.data

# the group index array and its pointer
arr_index = cgutils.create_struct_proxy(sig.args[2])(
context, builder, value=index
)
index_ptr = arr_index.data

# fill the struct explicitly
grp.group_data = group_data_ptr
grp.index = index_ptr
grp.size = size
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just my stylistic preference, but you may as well assign to the struct members directly rather than storing in intermediates (including replacing size with grp.size in the tuple unpacking of args, which will require a slight reordering to create the group before that unpacking). IMO the extra redirection only reduces readability in this case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refactored this as you suggested, it's definitely better now!


# return the struct by value
return grp._getvalue()


def cuda_Group_idx_max_or_min(context, builder, sig, args, function):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is Group capitalized? This seems unusual for a function name. Same for variables below.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this was changed by the time this comment landed. Let me know if the new names look good to you. Happy to iterate still :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible for this function to call lowering_function, or for both of these to use a common helper function? It would be nice to reduce duplication.

This function name should also mirror lowering_function so that it's obvious that this is a specialized version of that. Right now the two seem like completely independent functions based on the names.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't manage to come up with a way of fusing these that really reduced the lines of code. I found that the addition of the extra index argument and what's needed to form it makes it hard to excise much common code in a way that seems helpful. Open to suggestions here but leaving as-is for now.

"""
Instruction boilerplate used for calling a groupby reduction
__device__ function in the case where the function is either
`idxmax` or `idxmin`. See `lowering_function` for details. This
lowering differs from other reductions due to the presence of
the index. This results in the forward declaration expecting
an extra arg.
"""
retty = sig.return_type

grp = cgutils.create_struct_proxy(sig.args[0])(
context, builder, value=args[0]
)
grp_type = sig.args[0]

if grp_type.index_type != index_default_type:
brandon-b-miller marked this conversation as resolved.
Show resolved Hide resolved
raise TypeError(
f"Only inputs with default index dtype {index_default_type} "
"are supported."
)

group_dataty = grp_type.group_data_type
group_data_ptr = builder.alloca(grp.group_data.type)
builder.store(grp.group_data, group_data_ptr)

index_dataty = grp_type.group_index_type
index_ptr = builder.alloca(grp.index.type)
builder.store(grp.index, index_ptr)
type_key = (types.int64, grp_type.group_scalar_type)
func = call_cuda_functions[function][type_key]

return context.compile_internal(
builder,
func,
nb_signature(retty, group_dataty, index_dataty, grp_type.size_type),
(builder.load(group_data_ptr), builder.load(index_ptr), grp.size),
)


cuda_Group_max = partial(lowering_function, function="max")
cuda_Group_min = partial(lowering_function, function="min")
cuda_Group_sum = partial(lowering_function, function="sum")
cuda_Group_mean = partial(lowering_function, function="mean")
cuda_Group_std = partial(lowering_function, function="std")
cuda_Group_var = partial(lowering_function, function="var")

cuda_Group_idxmax = partial(cuda_Group_idx_max_or_min, function="idxmax")
cuda_Group_idxmin = partial(cuda_Group_idx_max_or_min, function="idxmin")


def cuda_Group_size(context, builder, sig, args):
grp = cgutils.create_struct_proxy(sig.args[0])(
context, builder, value=args[0]
)
return grp.size


def cuda_Group_count(context, builder, sig, args):
grp = cgutils.create_struct_proxy(sig.args[0])(
context, builder, value=args[0]
)
return grp.size
brandon-b-miller marked this conversation as resolved.
Show resolved Hide resolved


for ty in SUPPORTED_GROUPBY_NUMBA_TYPES:
cuda_lower("GroupType.max", GroupType(ty))(cuda_Group_max)
cuda_lower("GroupType.min", GroupType(ty))(cuda_Group_min)
cuda_lower("GroupType.sum", GroupType(ty))(cuda_Group_sum)
cuda_lower("GroupType.count", GroupType(ty))(cuda_Group_count)
cuda_lower("GroupType.size", GroupType(ty))(cuda_Group_size)
cuda_lower("GroupType.mean", GroupType(ty))(cuda_Group_mean)
cuda_lower("GroupType.std", GroupType(ty))(cuda_Group_std)
cuda_lower("GroupType.var", GroupType(ty))(cuda_Group_var)
cuda_lower("GroupType.idxmax", GroupType(ty, types.int64))(
cuda_Group_idxmax
)
cuda_lower("GroupType.idxmin", GroupType(ty, types.int64))(
cuda_Group_idxmin
)
Loading