[FEA] Move has_nulls functor template argument to member variable in row comparators and row hashers #6952

davidwendt · 2020-12-09T15:09:38Z

Currently, many libcudf column APIs are compiled into two paths -- one handles nulls and the other expects no nulls. The no-nulls path is generally faster because it removes the cost of accessing the validity bitmask. To create these two paths, we generally use a template argument on the functor or function mostly consistently named has_nulls which is usually determined early in the API code logic.

Using a template argument means we can create one source code that includes handling the null logic surrounded by an if(has_nulls) statement. The compiler will then create the two paths for us from the single source. We only need to invoke them individually as appropriate. What's more, the compiler optimizer will actually remove the entire if(has_nulls) statement in the has_nulls==false compiled kernel path.

Example of `has_nulls` templated functor invocation

Here is an example of invoking a has_nulls templated functor in cpp/src/stream_compaction/drop_duplicates.cu:

cudf/cpp/src/stream_compaction/drop_duplicates.cu

Lines 145 to 173 in b45fd4d

    
           if (cudf::has_nulls(keys)) { 
        
             auto comp = row_equality_comparator<true>( 
        
               *device_input_table, *device_input_table, nulls_equal == null_equality::EQUAL); 
        
             auto result_end = unique_copy(sorted_indices->view().begin<cudf::size_type>(), 
        
                                           sorted_indices->view().end<cudf::size_type>(), 
        
                                           unique_indices.begin<cudf::size_type>(), 
        
                                           comp, 
        
                                           keep, 
        
                                           stream); 
        
             return cudf::detail::slice( 
        
               column_view(unique_indices), 
        
               0, 
        
               thrust::distance(unique_indices.begin<cudf::size_type>(), result_end)); 
        
           } else { 
        
             auto comp = row_equality_comparator<false>( 
        
               *device_input_table, *device_input_table, nulls_equal == null_equality::EQUAL); 
        
             auto result_end = unique_copy(sorted_indices->view().begin<cudf::size_type>(), 
        
                                           sorted_indices->view().end<cudf::size_type>(), 
        
                                           unique_indices.begin<cudf::size_type>(), 
        
                                           comp, 
        
                                           keep, 
        
                                           stream); 
        
             return cudf::detail::slice( 
        
               column_view(unique_indices), 
        
               0, 
        
               thrust::distance(unique_indices.begin<cudf::size_type>(), result_end)); 
        
           }

Note the only difference between the if and else clauses is the instantiation of the row_equality_comparator with the template parameter.

The row_equality_comparator is defined in the cpp/include/cudf/table/row_operators.cuh:

cudf/cpp/include/cudf/table/row_operators.cuh

Lines 204 to 206 in b45fd4d

    
           template <bool has_nulls = true> 
        
           class row_equality_comparator { 
        
            public:

The has_nulls template argument is propagated the element_equality_comparator which has the if(has_nulls) statement.

Some of these invocations can be large and somewhat error prone.
https://github.com/rapidsai/cudf/blob/branch-0.18/cpp/src/sort/sort_impl.cuh#L68-L97
https://github.com/rapidsai/cudf/blob/branch-0.18/cpp/src/hash/hashing.cu#L138-L174
https://github.com/rapidsai/cudf/blob/branch-0.18/cpp/src/search/search.cu#L110-L142
https://github.com/rapidsai/cudf/blob/branch-0.18/cpp/src/groupby/sort/group_nunique.cu#L49-L90
Granted, some of these are only a few lines of code and clang formatting perhaps makes them look larger.

All of this means we can create a fast-path for non-null column cases but at the cost of compile time and size since much of the code is duplicated for us by the compiler. In some places, the null handling is a minimal set of instructions compared to the overall kernel code size. Also, in APIs that operate on tables (e.g. sort), a single null in any column in the table will cause the API to execute through the null-handling path for all tables/columns.

The dual compile path effects some of our biggest compile time offenders.

Here are the current top 30 source files ordered by compile time

1265968 CMakeFiles/cudf_base.dir/src/sort/sort.cu.o
1259518 CMakeFiles/cudf_base.dir/src/sort/stable_sort.cu.o
 991023 CMakeFiles/cudf_reductions.dir/src/reductions/all.cu.o
 983376 CMakeFiles/cudf_reductions.dir/src/reductions/any.cu.o
 728204 CMakeFiles/cudf_reductions.dir/src/reductions/sum_of_squares.cu.o
 727802 CMakeFiles/cudf_reductions.dir/src/reductions/product.cu.o
 718036 CMakeFiles/cudf_reductions.dir/src/reductions/sum.cu.o
 516079 CMakeFiles/cudf_rolling.dir/src/rolling/rolling.cu.o
 486964 CMakeFiles/cudf_reductions.dir/src/reductions/mean.cu.o
 444853 CMakeFiles/cudf_reductions.dir/src/reductions/scan.cu.o
 442416 CMakeFiles/cudf_join.dir/src/join/semi_join.cu.o
 414210 CMakeFiles/cudf_base.dir/src/groupby/hash/groupby.cu.o
 396110 CMakeFiles/cudf_hash.dir/src/hash/hashing.cu.o
 383931 CMakeFiles/cudf_base.dir/src/strings/sorting/sorting.cu.o
 304577 CMakeFiles/cudf_reductions.dir/src/reductions/std.cu.o
 300738 CMakeFiles/cudf_base.dir/src/stream_compaction/distinct_count.cu.o
 298303 CMakeFiles/cudf_reductions.dir/src/reductions/var.cu.o
 268265 CMakeFiles/cudf_base.dir/src/search/search.cu.o
 256580 CMakeFiles/cudf_base.dir/src/unary/math_ops.cu.o
 247038 CMakeFiles/cudf_base.dir/src/groupby/sort/group_nunique.cu.o
 222540 CMakeFiles/cudf_base.dir/src/groupby/sort/sort_helper.cu.o
 219502 CMakeFiles/cudf_base.dir/src/quantiles/quantile.cu.o
 204364 CMakeFiles/cudf_base.dir/src/sort/rank.cu.o
 203926 CMakeFiles/cudf_base.dir/src/stream_compaction/drop_duplicates.cu.o
 188213 CMakeFiles/cudf_base.dir/src/sort/is_sorted.cu.o
 185831 CMakeFiles/cudf_ast.dir/src/ast/transform.cu.o
 182763 CMakeFiles/cudf_base.dir/src/copying/copy.cu.o
 166152 CMakeFiles/cudf_base.dir/src/copying/gather.cu.o
 157837 CMakeFiles/cudf_partitioning.dir/src/partitioning/partitioning.cu.o
 131314 tests/CMakeFiles/cudftestutil.dir/utilities/column_utilities.cu.o
 ...

The first column is time in milliseconds from a ninja trace on my desktop using g++7 and CUDA 11.0.

Ignoring the current re-ballooning of the reduction APIs, the sort and hashing source files are still very slow. These use the templated comparators and hashers defined in the cpp/include/cudf/table/row_comparators.cuh

I propose to move some of the instances of the has_nulls template parameter to a functor member variable or function parameter as appropriate. Overall this means that has_nulls would be checked at runtime instead of compile time but the runtime if-statement only introduces a extra kernel instruction in the has_nulls path. Since the has_nulls value is set early in the logic before the kernel is launched, the kernel should therefore incur no divergence with other threads.

I prototyped moving has_nulls in the row comparators and row hashers on my local machine which required updating about 20 source files. I compared the outputs for gbenchmarks tests for sort, hash, merge, search, and join and found no signficant change in performance. Most of the benchmarks do not include nulls so they were a good measure of the impact of the extra instruction. (I can attach the benchmark results if necessary).

Results

Top 30 compile times with has_nulls converted to a runtime parameter.

 994075 CMakeFiles/cudf_reductions.dir/src/reductions/any.cu.o
 970520 CMakeFiles/cudf_reductions.dir/src/reductions/all.cu.o
 949222 CMakeFiles/cudf_base.dir/src/sort/sort.cu.o
 944946 CMakeFiles/cudf_base.dir/src/sort/stable_sort.cu.o
 728829 CMakeFiles/cudf_reductions.dir/src/reductions/sum_of_squares.cu.o
 728487 CMakeFiles/cudf_reductions.dir/src/reductions/sum.cu.o
 727735 CMakeFiles/cudf_reductions.dir/src/reductions/product.cu.o
 500829 CMakeFiles/cudf_rolling.dir/src/rolling/rolling.cu.o
 487375 CMakeFiles/cudf_join.dir/src/join/semi_join.cu.o
 473684 CMakeFiles/cudf_reductions.dir/src/reductions/mean.cu.o
 452647 CMakeFiles/cudf_reductions.dir/src/reductions/scan.cu.o
 391825 CMakeFiles/cudf_hash.dir/src/hash/hashing.cu.o
 391768 CMakeFiles/cudf_base.dir/src/strings/sorting/sorting.cu.o
 311602 CMakeFiles/cudf_reductions.dir/src/reductions/var.cu.o
 290547 CMakeFiles/cudf_reductions.dir/src/reductions/std.cu.o
 259011 CMakeFiles/cudf_base.dir/src/unary/math_ops.cu.o
 257266 CMakeFiles/cudf_base.dir/src/groupby/hash/groupby.cu.o
 249444 CMakeFiles/cudf_base.dir/src/search/search.cu.o
 219493 CMakeFiles/cudf_base.dir/src/stream_compaction/distinct_count.cu.o
 214205 CMakeFiles/cudf_base.dir/src/quantiles/quantile.cu.o
 185999 CMakeFiles/cudf_ast.dir/src/ast/transform.cu.o
 180346 CMakeFiles/cudf_base.dir/src/stream_compaction/drop_duplicates.cu.o
 179347 CMakeFiles/cudf_base.dir/src/copying/copy.cu.o
 161075 CMakeFiles/cudf_base.dir/src/copying/gather.cu.o
 152886 CMakeFiles/cudf_base.dir/src/groupby/sort/group_nunique.cu.o
 149715 CMakeFiles/cudf_base.dir/src/sort/rank.cu.o
 144347 CMakeFiles/cudf_base.dir/src/groupby/sort/sort_helper.cu.o
 143281 CMakeFiles/cudf_partitioning.dir/src/partitioning/partitioning.cu.o
 133652 CMakeFiles/cudf_base.dir/src/sort/is_sorted.cu.o
 130779 tests/CMakeFiles/cudftestutil.dir/utilities/column_utilities.cu.o

The overall improvement in compile time is about 7.5 minutes (30 min -> 22.5 min).
The compile size of libcudf_base.so is reduced by about 33MB (287MB -> 254MB).

Only cudf::merge showed any significant performance drop but since merge implements its own comparator functors, re-instating it's has_nulls template implementation was not difficult. And merge.cu is not a significant compile time concern (ranked at 46). This means, globally removing this kind of parameter may not be necessary. My proposal is to remove them only from the row comparators and row hashers as appropriate for now.

This is a non-breaking change since it only effects internal functors and kernels functions.

The text was updated successfully, but these errors were encountered:

@davidwendt

Here are the current top 10 compile time offenders ``` 2107683 CMakeFiles/cudf_reductions.dir/src/reductions/any.cu.o 2106547 CMakeFiles/cudf_reductions.dir/src/reductions/all.cu.o 1538794 CMakeFiles/cudf_reductions.dir/src/reductions/sum_of_squares.cu.o 1522533 CMakeFiles/cudf_reductions.dir/src/reductions/product.cu.o 1519147 CMakeFiles/cudf_reductions.dir/src/reductions/sum.cu.o 1188127 CMakeFiles/cudf_base.dir/src/groupby/sort/group_sum.cu.o 1006601 CMakeFiles/cudf_base.dir/src/groupby/hash/groupby.cu.o 789776 CMakeFiles/cudf_reductions.dir/src/reductions/mean.cu.o 651817 CMakeFiles/cudf_join.dir/src/join/semi_join.cu.o 539513 CMakeFiles/cudf_hash.dir/src/hash/hashing.cu.o ... ``` Times are in milliseconds so `any.cu` and `all.cu` take 35 minutes each to build on my machine with CUDA 10.1. The times have increased with the addition of dictionary and fixed-point types support. The large times are directly related to some aggressive inlining of the iterators in the `cub::DeviceReduce::Reduce` used by all the reduction aggregations. For small iterators, this is not an issue. The dictionary iterator is more complex since it must type-dispatch the indices and then access the keys data. The code is very fast but causes large compile times when used by CUB Reduce. This PR creates new specialization logic for dictionary columns to call `thrust::all_of` for `all()` and and `thrust::any_of` for `any()` instead of CUB Reduce. This reduces the compile time significantly with little effect on the runtime. In fact, the thrust algorithms appear to have an _early-out_ feature which can be faster than a generic reduce depending on the data. The compile time for `any.cu` and `all.cu` is now around 3 minutes each. Also in this PR, I've changed the `dictionary_pair_iterator` to convert the `has_nulls` template parameter to runtime parameter. This adds very little overhead to the iterator but improves the compile time for all the other reductions source files. A more general process for applying this to other iterators and operators is mentioned in #6952 The compile time for the other reductions source files is now about half their original time. Finally, this PR includes gbenchmarks for dictionary columns in reduction operations. These were necessary to measure how changes impacted the runtime. Authors: - David (@davidwendt) Approvers: - Jake Hemstad (@jrhemstad) - @nvdbaranec URL: #7242

github-actions · 2021-02-16T20:20:13Z

This issue has been marked stale due to no recent activity in the past 30d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be marked rotten if there is no activity in the next 60d.

This is a first step in fixing issues brought up in rapidsai#6952 and rapidsai#7573. The iterator produces `thrust::optional<T>` to better represent nullable column elements or scalars. `make_optional_iterator` supports three different `contains_null` modes: - `YES` means that the column supports nulls and has null values, therefore the optional might not contain a value - `NO` means that the column has no null values, therefore the optional will always have a value - `DYNAMIC` defers the assumption of nullability to runtime with the users stating on construction of the iterator if column has nulls.

…7772) Introduces `make_optional_iterator` for nullable column and scalars, as the first step in fixing issues brought up in #6952 and #7573. The iterator produces `thrust::optional<T>` to better represent nullable column elements and scalars. `make_optional_iterator` supports three different `contains_null` modes: - `YES` means that the column supports nulls and has null values, therefore the optional might not contain a value - `NO` means that the column has no null values, therefore the optional will always have a value - `DYNAMIC` defers the assumption of nullability to runtime with the users stating on construction of the iterator if column has nulls. Authors: - Robert Maynard (https://github.com/robertmaynard) Approvers: - Jake Hemstad (https://github.com/jrhemstad) - Paul Taylor (https://github.com/trxcllnt) - David Wendt (https://github.com/davidwendt) URL: #7772

github-actions · 2021-05-17T21:04:49Z

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

Closes #6952 This PR allows the `has_nulls` template parameter for row operators to be used a runtime parameter in places where the null-handling logic has little to no affect on runtime performance. This can improve compile time as described in #6952. This will also close #9152 and #9580 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Jake Hemstad (https://github.com/jrhemstad) - Nghia Truong (https://github.com/ttnghia) - Conor Hoekstra (https://github.com/codereport) URL: #9623

davidwendt added feature request New feature or request Needs Triage Need team to review and classify libcudf Affects libcudf (C++/CUDA) code. labels Dec 9, 2020

kkraus14 removed the Needs Triage Need team to review and classify label Dec 10, 2020

davidwendt mentioned this issue Jan 28, 2021

Refactor dictionary support for reductions any/all #7242

Merged

github-actions bot added the stale label Feb 16, 2021

davidwendt mentioned this issue Mar 11, 2021

remove has_nulls template param in pair_begin, make_pair_iterator #7573

Closed

robertmaynard mentioned this issue Mar 31, 2021

[REVIEW] Introduces make_optional_iterator for nullable column and scalars #7772

Merged

github-actions bot added the inactive-90d label May 17, 2021

davidwendt mentioned this issue Nov 2, 2021

[FEA] Enable dynamic "has_nulls" for row operators #9580

Closed

davidwendt self-assigned this Nov 2, 2021

davidwendt mentioned this issue Nov 8, 2021

Allow runtime has_nulls parameter for row operators #9623

Merged

rapids-bot bot closed this as completed in #9623 Dec 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Move has_nulls functor template argument to member variable in row comparators and row hashers #6952

[FEA] Move has_nulls functor template argument to member variable in row comparators and row hashers #6952

davidwendt commented Dec 9, 2020 •

edited

Loading

github-actions bot commented Feb 16, 2021

github-actions bot commented May 17, 2021

[FEA] Move has_nulls functor template argument to member variable in row comparators and row hashers #6952

[FEA] Move has_nulls functor template argument to member variable in row comparators and row hashers #6952

Comments

davidwendt commented Dec 9, 2020 • edited Loading

github-actions bot commented Feb 16, 2021

github-actions bot commented May 17, 2021

davidwendt commented Dec 9, 2020 •

edited

Loading