Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Move has_nulls functor template argument to member variable in row comparators and row hashers #6952

Closed
davidwendt opened this issue Dec 9, 2020 · 2 comments · Fixed by #9623
Assignees
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.

Comments

@davidwendt
Copy link
Contributor

davidwendt commented Dec 9, 2020

Currently, many libcudf column APIs are compiled into two paths -- one handles nulls and the other expects no nulls. The no-nulls path is generally faster because it removes the cost of accessing the validity bitmask. To create these two paths, we generally use a template argument on the functor or function mostly consistently named has_nulls which is usually determined early in the API code logic.

Using a template argument means we can create one source code that includes handling the null logic surrounded by an if(has_nulls) statement. The compiler will then create the two paths for us from the single source. We only need to invoke them individually as appropriate. What's more, the compiler optimizer will actually remove the entire if(has_nulls) statement in the has_nulls==false compiled kernel path.

Example of `has_nulls` templated functor invocation

Here is an example of invoking a has_nulls templated functor in cpp/src/stream_compaction/drop_duplicates.cu:

if (cudf::has_nulls(keys)) {
auto comp = row_equality_comparator<true>(
*device_input_table, *device_input_table, nulls_equal == null_equality::EQUAL);
auto result_end = unique_copy(sorted_indices->view().begin<cudf::size_type>(),
sorted_indices->view().end<cudf::size_type>(),
unique_indices.begin<cudf::size_type>(),
comp,
keep,
stream);
return cudf::detail::slice(
column_view(unique_indices),
0,
thrust::distance(unique_indices.begin<cudf::size_type>(), result_end));
} else {
auto comp = row_equality_comparator<false>(
*device_input_table, *device_input_table, nulls_equal == null_equality::EQUAL);
auto result_end = unique_copy(sorted_indices->view().begin<cudf::size_type>(),
sorted_indices->view().end<cudf::size_type>(),
unique_indices.begin<cudf::size_type>(),
comp,
keep,
stream);
return cudf::detail::slice(
column_view(unique_indices),
0,
thrust::distance(unique_indices.begin<cudf::size_type>(), result_end));
}

Note the only difference between the if and else clauses is the instantiation of the row_equality_comparator with the template parameter.

The row_equality_comparator is defined in the cpp/include/cudf/table/row_operators.cuh:

template <bool has_nulls = true>
class row_equality_comparator {
public:

The has_nulls template argument is propagated the element_equality_comparator which has the if(has_nulls) statement.

Some of these invocations can be large and somewhat error prone.
https://github.com/rapidsai/cudf/blob/branch-0.18/cpp/src/sort/sort_impl.cuh#L68-L97
https://github.com/rapidsai/cudf/blob/branch-0.18/cpp/src/hash/hashing.cu#L138-L174
https://github.com/rapidsai/cudf/blob/branch-0.18/cpp/src/search/search.cu#L110-L142
https://github.com/rapidsai/cudf/blob/branch-0.18/cpp/src/groupby/sort/group_nunique.cu#L49-L90
Granted, some of these are only a few lines of code and clang formatting perhaps makes them look larger.

All of this means we can create a fast-path for non-null column cases but at the cost of compile time and size since much of the code is duplicated for us by the compiler. In some places, the null handling is a minimal set of instructions compared to the overall kernel code size. Also, in APIs that operate on tables (e.g. sort), a single null in any column in the table will cause the API to execute through the null-handling path for all tables/columns.

The dual compile path effects some of our biggest compile time offenders.

Here are the current top 30 source files ordered by compile time
1265968 CMakeFiles/cudf_base.dir/src/sort/sort.cu.o
1259518 CMakeFiles/cudf_base.dir/src/sort/stable_sort.cu.o
 991023 CMakeFiles/cudf_reductions.dir/src/reductions/all.cu.o
 983376 CMakeFiles/cudf_reductions.dir/src/reductions/any.cu.o
 728204 CMakeFiles/cudf_reductions.dir/src/reductions/sum_of_squares.cu.o
 727802 CMakeFiles/cudf_reductions.dir/src/reductions/product.cu.o
 718036 CMakeFiles/cudf_reductions.dir/src/reductions/sum.cu.o
 516079 CMakeFiles/cudf_rolling.dir/src/rolling/rolling.cu.o
 486964 CMakeFiles/cudf_reductions.dir/src/reductions/mean.cu.o
 444853 CMakeFiles/cudf_reductions.dir/src/reductions/scan.cu.o
 442416 CMakeFiles/cudf_join.dir/src/join/semi_join.cu.o
 414210 CMakeFiles/cudf_base.dir/src/groupby/hash/groupby.cu.o
 396110 CMakeFiles/cudf_hash.dir/src/hash/hashing.cu.o
 383931 CMakeFiles/cudf_base.dir/src/strings/sorting/sorting.cu.o
 304577 CMakeFiles/cudf_reductions.dir/src/reductions/std.cu.o
 300738 CMakeFiles/cudf_base.dir/src/stream_compaction/distinct_count.cu.o
 298303 CMakeFiles/cudf_reductions.dir/src/reductions/var.cu.o
 268265 CMakeFiles/cudf_base.dir/src/search/search.cu.o
 256580 CMakeFiles/cudf_base.dir/src/unary/math_ops.cu.o
 247038 CMakeFiles/cudf_base.dir/src/groupby/sort/group_nunique.cu.o
 222540 CMakeFiles/cudf_base.dir/src/groupby/sort/sort_helper.cu.o
 219502 CMakeFiles/cudf_base.dir/src/quantiles/quantile.cu.o
 204364 CMakeFiles/cudf_base.dir/src/sort/rank.cu.o
 203926 CMakeFiles/cudf_base.dir/src/stream_compaction/drop_duplicates.cu.o
 188213 CMakeFiles/cudf_base.dir/src/sort/is_sorted.cu.o
 185831 CMakeFiles/cudf_ast.dir/src/ast/transform.cu.o
 182763 CMakeFiles/cudf_base.dir/src/copying/copy.cu.o
 166152 CMakeFiles/cudf_base.dir/src/copying/gather.cu.o
 157837 CMakeFiles/cudf_partitioning.dir/src/partitioning/partitioning.cu.o
 131314 tests/CMakeFiles/cudftestutil.dir/utilities/column_utilities.cu.o
 ...

The first column is time in milliseconds from a ninja trace on my desktop using g++7 and CUDA 11.0.

has_nulls_baseline

Ignoring the current re-ballooning of the reduction APIs, the sort and hashing source files are still very slow. These use the templated comparators and hashers defined in the cpp/include/cudf/table/row_comparators.cuh

I propose to move some of the instances of the has_nulls template parameter to a functor member variable or function parameter as appropriate. Overall this means that has_nulls would be checked at runtime instead of compile time but the runtime if-statement only introduces a extra kernel instruction in the has_nulls path. Since the has_nulls value is set early in the logic before the kernel is launched, the kernel should therefore incur no divergence with other threads.

I prototyped moving has_nulls in the row comparators and row hashers on my local machine which required updating about 20 source files. I compared the outputs for gbenchmarks tests for sort, hash, merge, search, and join and found no signficant change in performance. Most of the benchmarks do not include nulls so they were a good measure of the impact of the extra instruction. (I can attach the benchmark results if necessary).

Results

Top 30 compile times with has_nulls converted to a runtime parameter.

 994075 CMakeFiles/cudf_reductions.dir/src/reductions/any.cu.o
 970520 CMakeFiles/cudf_reductions.dir/src/reductions/all.cu.o
 949222 CMakeFiles/cudf_base.dir/src/sort/sort.cu.o
 944946 CMakeFiles/cudf_base.dir/src/sort/stable_sort.cu.o
 728829 CMakeFiles/cudf_reductions.dir/src/reductions/sum_of_squares.cu.o
 728487 CMakeFiles/cudf_reductions.dir/src/reductions/sum.cu.o
 727735 CMakeFiles/cudf_reductions.dir/src/reductions/product.cu.o
 500829 CMakeFiles/cudf_rolling.dir/src/rolling/rolling.cu.o
 487375 CMakeFiles/cudf_join.dir/src/join/semi_join.cu.o
 473684 CMakeFiles/cudf_reductions.dir/src/reductions/mean.cu.o
 452647 CMakeFiles/cudf_reductions.dir/src/reductions/scan.cu.o
 391825 CMakeFiles/cudf_hash.dir/src/hash/hashing.cu.o
 391768 CMakeFiles/cudf_base.dir/src/strings/sorting/sorting.cu.o
 311602 CMakeFiles/cudf_reductions.dir/src/reductions/var.cu.o
 290547 CMakeFiles/cudf_reductions.dir/src/reductions/std.cu.o
 259011 CMakeFiles/cudf_base.dir/src/unary/math_ops.cu.o
 257266 CMakeFiles/cudf_base.dir/src/groupby/hash/groupby.cu.o
 249444 CMakeFiles/cudf_base.dir/src/search/search.cu.o
 219493 CMakeFiles/cudf_base.dir/src/stream_compaction/distinct_count.cu.o
 214205 CMakeFiles/cudf_base.dir/src/quantiles/quantile.cu.o
 185999 CMakeFiles/cudf_ast.dir/src/ast/transform.cu.o
 180346 CMakeFiles/cudf_base.dir/src/stream_compaction/drop_duplicates.cu.o
 179347 CMakeFiles/cudf_base.dir/src/copying/copy.cu.o
 161075 CMakeFiles/cudf_base.dir/src/copying/gather.cu.o
 152886 CMakeFiles/cudf_base.dir/src/groupby/sort/group_nunique.cu.o
 149715 CMakeFiles/cudf_base.dir/src/sort/rank.cu.o
 144347 CMakeFiles/cudf_base.dir/src/groupby/sort/sort_helper.cu.o
 143281 CMakeFiles/cudf_partitioning.dir/src/partitioning/partitioning.cu.o
 133652 CMakeFiles/cudf_base.dir/src/sort/is_sorted.cu.o
 130779 tests/CMakeFiles/cudftestutil.dir/utilities/column_utilities.cu.o

has_nulls_removed

The overall improvement in compile time is about 7.5 minutes (30 min -> 22.5 min).
The compile size of libcudf_base.so is reduced by about 33MB (287MB -> 254MB).

Only cudf::merge showed any significant performance drop but since merge implements its own comparator functors, re-instating it's has_nulls template implementation was not difficult. And merge.cu is not a significant compile time concern (ranked at 46). This means, globally removing this kind of parameter may not be necessary. My proposal is to remove them only from the row comparators and row hashers as appropriate for now.

This is a non-breaking change since it only effects internal functors and kernels functions.

@davidwendt davidwendt added feature request New feature or request Needs Triage Need team to review and classify libcudf Affects libcudf (C++/CUDA) code. labels Dec 9, 2020
@kkraus14 kkraus14 removed the Needs Triage Need team to review and classify label Dec 10, 2020
rapids-bot bot pushed a commit that referenced this issue Feb 8, 2021
Here are the current top 10 compile time offenders
```
2107683 CMakeFiles/cudf_reductions.dir/src/reductions/any.cu.o
2106547 CMakeFiles/cudf_reductions.dir/src/reductions/all.cu.o
1538794 CMakeFiles/cudf_reductions.dir/src/reductions/sum_of_squares.cu.o
1522533 CMakeFiles/cudf_reductions.dir/src/reductions/product.cu.o
1519147 CMakeFiles/cudf_reductions.dir/src/reductions/sum.cu.o
1188127 CMakeFiles/cudf_base.dir/src/groupby/sort/group_sum.cu.o
1006601 CMakeFiles/cudf_base.dir/src/groupby/hash/groupby.cu.o
 789776 CMakeFiles/cudf_reductions.dir/src/reductions/mean.cu.o
 651817 CMakeFiles/cudf_join.dir/src/join/semi_join.cu.o
 539513 CMakeFiles/cudf_hash.dir/src/hash/hashing.cu.o
...
```
Times are in milliseconds so `any.cu` and `all.cu` take 35 minutes each to build on my machine with CUDA 10.1.

The times have increased with the addition of dictionary and fixed-point types support. The large times are directly related to some aggressive inlining of the iterators in the `cub::DeviceReduce::Reduce` used by all the reduction aggregations. For small iterators, this is not an issue. The dictionary iterator is more complex since it must type-dispatch the indices and then access the keys data. The code is very fast but causes large compile times when used by CUB Reduce.

This PR creates new specialization logic for dictionary columns to call `thrust::all_of` for `all()` and and `thrust::any_of` for `any()` instead of CUB Reduce. This reduces the compile time significantly with little effect on the runtime. In fact, the thrust algorithms appear to have an _early-out_ feature which can be faster than a generic reduce depending on the data.

The compile time for `any.cu` and `all.cu` is now around 3 minutes each.

Also in this PR, I've changed the `dictionary_pair_iterator` to convert the `has_nulls` template parameter to runtime parameter. This adds very little overhead to the iterator but improves the compile time for all the other reductions source files. A more general process for applying this to other iterators and operators is mentioned in #6952 The compile time for the other reductions source files is now about half their original time.

Finally, this PR includes gbenchmarks for dictionary columns in reduction operations. These were necessary to measure how changes impacted the runtime.

Authors:
  - David (@davidwendt)

Approvers:
  - Jake Hemstad (@jrhemstad)
  - @nvdbaranec

URL: #7242
@github-actions
Copy link

This issue has been marked stale due to no recent activity in the past 30d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be marked rotten if there is no activity in the next 60d.

@github-actions github-actions bot added the stale label Feb 16, 2021
robertmaynard added a commit to robertmaynard/cudf that referenced this issue Apr 8, 2021
This is a first step in fixing issues brought up in rapidsai#6952 and rapidsai#7573.

The iterator produces `thrust::optional<T>` to better represent nullable
column elements or scalars.

`make_optional_iterator` supports three different `contains_null` modes:

    - `YES` means that the column supports nulls and has null values, therefore
     the optional might not contain a value

    - `NO` means that the column has no null values, therefore the optional will
     always have a value

    - `DYNAMIC` defers the assumption of nullability to runtime with the users stating
     on construction of the iterator if column has nulls.
rapids-bot bot pushed a commit that referenced this issue Apr 20, 2021
…7772)

Introduces `make_optional_iterator` for nullable column and scalars, as the first step in fixing issues brought up in #6952 and #7573.

The iterator produces `thrust::optional<T>` to better represent nullable column elements and scalars.

`make_optional_iterator` supports three different `contains_null` modes:

    - `YES` means that the column supports nulls and has null values, therefore
     the optional might not contain a value

    - `NO` means that the column has no null values, therefore the optional will
     always have a value

    - `DYNAMIC` defers the assumption of nullability to runtime with the users stating
     on construction of the iterator if column has nulls.

Authors:
  - Robert Maynard (https://github.com/robertmaynard)

Approvers:
  - Jake Hemstad (https://github.com/jrhemstad)
  - Paul Taylor (https://github.com/trxcllnt)
  - David Wendt (https://github.com/davidwendt)

URL: #7772
@github-actions
Copy link

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

@davidwendt davidwendt self-assigned this Nov 2, 2021
@rapids-bot rapids-bot bot closed this as completed in #9623 Dec 6, 2021
rapids-bot bot pushed a commit that referenced this issue Dec 6, 2021
Closes #6952 

This PR allows the `has_nulls` template parameter for row operators to be used a runtime parameter in places where the null-handling logic has little to no affect on runtime performance. 
This can improve compile time as described in #6952.

This will also close #9152 and #9580

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Jake Hemstad (https://github.com/jrhemstad)
  - Nghia Truong (https://github.com/ttnghia)
  - Conor Hoekstra (https://github.com/codereport)

URL: #9623
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants