Rework some code logic to reduce iterator and comparator inlining to improve compile time #12900

davidwendt · 2023-03-07T19:49:47Z

Description

Disables inlining the device code logic for the row operators for nested column types did not work as hoped.
Some files took longer to compile and some functions ran 20% slower for large rows.

Reworking individual source files to break up the code logic into multiple kernels seems to work well for compile time while having a smaller effect on performance. The goal is to only rework the nested column code paths.
Here are some source files that have compile time issues and are improved in this PR.

source file	current	PR
stream_compaction/unique_count.cu	18 min	13 min
groupby/sort/group_nunique.cu	16 min	2 min
stream_compaction/unique.cu	16 min	5 min
groupby/sort/sort_helper.cu	10 min	6.5 min
search/contains_scalar.cu	12 min	4.7 min
sort/is_sorted.cu	9 min	7 min
groupby/sort/group_std.cu	7 min	1.2 min
groupby/sort/group_m2.cu	6 min	1.2 min

Available benchmarks showed minimal impact to performance.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

cpp/include/cudf/table/experimental/row_operators.cuh

PointKernel · 2023-03-21T20:07:16Z

Is it fair to say that we should avoid the below functions to reduce build time?

cudf::detail::make_counting_transform_iterator
thrust::make_transform_iterator
thrust::unique_copy
thrust::count_if
thrust::is_sorted

davidwendt · 2023-03-21T20:23:00Z

Is it fair to say that we should avoid the below functions to reduce build time?

cudf::detail::make_counting_transform_iterator

thrust::make_transform_iterator

thrust::unique_copy

thrust::count_if

thrust::is_sorted

Not in general. The transform iterators are certainly not the issue. The copy_if, and is_sorted all end up calling some version of cub:reduce. If the iterators/comparators they are given are more than a few lines of logic, the code bloat can be extreme due to large amount of inlining of the inputs throughout the generated kernel. And unique_copy has similar issues with inlining the comparator.
I will try to work on a more formal document outlining what to avoid and how to get around this problem.

PointKernel

LGTM

Looking forward to seeing the build time guideline doc.

cpp/src/groupby/sort/sort_helper.cu

cpp/src/groupby/sort/group_m2.cu

cpp/src/sort/is_sorted.cu

bdice

This is fantastic. These optimizations look correct and seem sensible, when considering the kernel complexity that we are now able to split up. I'd like a few small comments like the ones I suggested here, to indicate that our choices of algorithms are informed by compile time. That will clear things up for the reader, and help prevent "clever" refactors down the line.

bdice · 2023-03-22T05:30:12Z

cpp/src/stream_compaction/unique.cu

-                                  keep,
-                                  stream);
+  size_type const unique_size = [&] {
+    if (cudf::detail::has_nested_columns(keys_view)) {


If this compiles faster for nested types, does it also compile faster for non-nested types? If it's possible to unify these and have a single implementation of the algorithms, I would prefer that (rather than one transform + copy_if for nested types and one unique_copy for non-nested types).

If there are considerations like runtime, memory usage, etc. that warrant two separate implementations, then let's inform the reader with some comments explaining this decision.

It does compile faster for non-nested types but the performance impact was too large (20-50% increase) for this path.

cpp/src/stream_compaction/unique_count.cu

cpp/src/sort/is_sorted.cu

bdice

Love the comments. Thanks for that -- it helps a lot for future readers, and makes us more aware of the process of making compile time improvements.

ttnghia

I love this.

davidwendt · 2023-03-27T12:47:32Z

/merge

davidwendt · 2023-03-27T13:51:25Z

/merge

Disable inline of row operators for nested column types

bd63a2b

davidwendt added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Mar 7, 2023

davidwendt self-assigned this Mar 7, 2023

bdice reviewed Mar 7, 2023

View reviewed changes

cpp/include/cudf/table/experimental/row_operators.cuh Outdated Show resolved Hide resolved

davidwendt added 17 commits March 7, 2023 18:54

change attribute(noinline) to noinline

1446e08

Merge branch 'branch-23.04' into row-ops-no-inline

8da8470

Merge branch 'branch-23.04' into row-ops-no-inline

bcc2a14

use transform for row-ops intermediate result

2453e7d

Merge branch 'branch-23.04' into row-ops-no-inline

edd2583

Merge branch 'branch-23.04' into row-ops-no-inline

993c36d

fix bool logic from count() return

47f1332

Merge branch 'branch-23.04' into row-ops-no-inline

fb50223

add transform/count to unique_count

8cdf813

Merge branch 'branch-23.04' into row-ops-no-inline

29601dd

Merge branch 'branch-23.04' into row-ops-no-inline

b842763

Merge branch 'branch-23.04' into row-ops-no-inline

2a7f2de

Merge branch 'branch-23.04' into row-ops-no-inline

b77db9d

undo no-inline declaration

f8fe035

Merge branch 'branch-23.04' into row-ops-no-inline

76445b7

Merge branch 'branch-23.04' into row-ops-no-inline

9c5ee0c

Merge branch 'branch-23.04' into row-ops-no-inline

4cca497

davidwendt changed the title ~~Disable inline of row operators for nested column types~~ Rework row operator usage for nested column types to improve compile time Mar 16, 2023

davidwendt added 5 commits March 16, 2023 08:42

rework group-nunique to use intermediate buffer

3ccd0a4

Merge branch 'branch-23.04' into row-ops-no-inline

f994e53

use temp buffer for reduce-by-key calls

321c4d4

Merge branch 'branch-23.04' into row-ops-no-inline

7f8ed2c

cleanup comments

0370e24

davidwendt requested review from hyperbolic2346 and PointKernel March 21, 2023 18:00

Merge branch 'branch-23.04' into row-ops-no-inline

e4b7c8e

PointKernel approved these changes Mar 21, 2023

View reviewed changes

cpp/src/groupby/sort/sort_helper.cu Show resolved Hide resolved

cpp/src/groupby/sort/group_m2.cu Outdated Show resolved Hide resolved

cpp/src/sort/is_sorted.cu Outdated Show resolved Hide resolved

davidwendt added 2 commits March 21, 2023 19:26

prefer using counting-iterator over factory call

6784431

Merge branch 'branch-23.04' into row-ops-no-inline

25912ca

bdice reviewed Mar 22, 2023

View reviewed changes

davidwendt added 3 commits March 22, 2023 10:41

Merge branch 'branch-23.04' into row-ops-no-inline

7868b0e

add benchmarks for unique_count

69da72a

add comments for new code patterns

2a4f7de

github-actions bot added the CMake CMake build issue label Mar 22, 2023

fix style violation

c5870e9

davidwendt requested a review from bdice March 23, 2023 11:30

davidwendt added 5 commits March 23, 2023 07:30

Merge branch 'branch-23.04' into row-ops-no-inline

03e8835

test cmake change

c142598

revert temp cmake change

00f5130

Merge branch 'branch-23.04' into row-ops-no-inline

52f8c29

Merge branch 'branch-23.04' into row-ops-no-inline

80803a1

hyperbolic2346 approved these changes Mar 24, 2023

View reviewed changes

bdice approved these changes Mar 24, 2023

View reviewed changes

Merge branch 'branch-23.04' into row-ops-no-inline

5089cfe

ttnghia approved these changes Mar 24, 2023

View reviewed changes

Merge branch 'branch-23.04' into row-ops-no-inline

bfdf963

rapids-bot bot merged commit 12dc130 into rapidsai:branch-23.04 Mar 27, 2023

davidwendt deleted the row-ops-no-inline branch March 27, 2023 13:51

ttnghia mentioned this pull request Mar 28, 2023

Update join to use experimental row hasher and comparator #12787

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rework some code logic to reduce iterator and comparator inlining to improve compile time #12900

Rework some code logic to reduce iterator and comparator inlining to improve compile time #12900

davidwendt commented Mar 7, 2023 •

edited

Loading

PointKernel commented Mar 21, 2023

davidwendt commented Mar 21, 2023

PointKernel left a comment

bdice left a comment

bdice Mar 22, 2023

davidwendt Mar 23, 2023

bdice left a comment

ttnghia left a comment

davidwendt commented Mar 27, 2023

davidwendt commented Mar 27, 2023

Rework some code logic to reduce iterator and comparator inlining to improve compile time #12900

Rework some code logic to reduce iterator and comparator inlining to improve compile time #12900

Conversation

davidwendt commented Mar 7, 2023 • edited Loading

Description

Checklist

PointKernel commented Mar 21, 2023

davidwendt commented Mar 21, 2023

PointKernel left a comment

Choose a reason for hiding this comment

bdice left a comment

Choose a reason for hiding this comment

bdice Mar 22, 2023

Choose a reason for hiding this comment

davidwendt Mar 23, 2023

Choose a reason for hiding this comment

bdice left a comment

Choose a reason for hiding this comment

ttnghia left a comment

Choose a reason for hiding this comment

davidwendt commented Mar 27, 2023

davidwendt commented Mar 27, 2023

davidwendt commented Mar 7, 2023 •

edited

Loading