-
Notifications
You must be signed in to change notification settings - Fork 912
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Move has_nulls functor template argument to member variable in row comparators and row hashers #6952
Comments
Here are the current top 10 compile time offenders ``` 2107683 CMakeFiles/cudf_reductions.dir/src/reductions/any.cu.o 2106547 CMakeFiles/cudf_reductions.dir/src/reductions/all.cu.o 1538794 CMakeFiles/cudf_reductions.dir/src/reductions/sum_of_squares.cu.o 1522533 CMakeFiles/cudf_reductions.dir/src/reductions/product.cu.o 1519147 CMakeFiles/cudf_reductions.dir/src/reductions/sum.cu.o 1188127 CMakeFiles/cudf_base.dir/src/groupby/sort/group_sum.cu.o 1006601 CMakeFiles/cudf_base.dir/src/groupby/hash/groupby.cu.o 789776 CMakeFiles/cudf_reductions.dir/src/reductions/mean.cu.o 651817 CMakeFiles/cudf_join.dir/src/join/semi_join.cu.o 539513 CMakeFiles/cudf_hash.dir/src/hash/hashing.cu.o ... ``` Times are in milliseconds so `any.cu` and `all.cu` take 35 minutes each to build on my machine with CUDA 10.1. The times have increased with the addition of dictionary and fixed-point types support. The large times are directly related to some aggressive inlining of the iterators in the `cub::DeviceReduce::Reduce` used by all the reduction aggregations. For small iterators, this is not an issue. The dictionary iterator is more complex since it must type-dispatch the indices and then access the keys data. The code is very fast but causes large compile times when used by CUB Reduce. This PR creates new specialization logic for dictionary columns to call `thrust::all_of` for `all()` and and `thrust::any_of` for `any()` instead of CUB Reduce. This reduces the compile time significantly with little effect on the runtime. In fact, the thrust algorithms appear to have an _early-out_ feature which can be faster than a generic reduce depending on the data. The compile time for `any.cu` and `all.cu` is now around 3 minutes each. Also in this PR, I've changed the `dictionary_pair_iterator` to convert the `has_nulls` template parameter to runtime parameter. This adds very little overhead to the iterator but improves the compile time for all the other reductions source files. A more general process for applying this to other iterators and operators is mentioned in #6952 The compile time for the other reductions source files is now about half their original time. Finally, this PR includes gbenchmarks for dictionary columns in reduction operations. These were necessary to measure how changes impacted the runtime. Authors: - David (@davidwendt) Approvers: - Jake Hemstad (@jrhemstad) - @nvdbaranec URL: #7242
This issue has been marked stale due to no recent activity in the past 30d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be marked rotten if there is no activity in the next 60d. |
This is a first step in fixing issues brought up in rapidsai#6952 and rapidsai#7573. The iterator produces `thrust::optional<T>` to better represent nullable column elements or scalars. `make_optional_iterator` supports three different `contains_null` modes: - `YES` means that the column supports nulls and has null values, therefore the optional might not contain a value - `NO` means that the column has no null values, therefore the optional will always have a value - `DYNAMIC` defers the assumption of nullability to runtime with the users stating on construction of the iterator if column has nulls.
…7772) Introduces `make_optional_iterator` for nullable column and scalars, as the first step in fixing issues brought up in #6952 and #7573. The iterator produces `thrust::optional<T>` to better represent nullable column elements and scalars. `make_optional_iterator` supports three different `contains_null` modes: - `YES` means that the column supports nulls and has null values, therefore the optional might not contain a value - `NO` means that the column has no null values, therefore the optional will always have a value - `DYNAMIC` defers the assumption of nullability to runtime with the users stating on construction of the iterator if column has nulls. Authors: - Robert Maynard (https://github.com/robertmaynard) Approvers: - Jake Hemstad (https://github.com/jrhemstad) - Paul Taylor (https://github.com/trxcllnt) - David Wendt (https://github.com/davidwendt) URL: #7772
This issue has been labeled |
Closes #6952 This PR allows the `has_nulls` template parameter for row operators to be used a runtime parameter in places where the null-handling logic has little to no affect on runtime performance. This can improve compile time as described in #6952. This will also close #9152 and #9580 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Jake Hemstad (https://github.com/jrhemstad) - Nghia Truong (https://github.com/ttnghia) - Conor Hoekstra (https://github.com/codereport) URL: #9623
Currently, many libcudf column APIs are compiled into two paths -- one handles nulls and the other expects no nulls. The no-nulls path is generally faster because it removes the cost of accessing the validity bitmask. To create these two paths, we generally use a template argument on the functor or function mostly consistently named
has_nulls
which is usually determined early in the API code logic.Using a template argument means we can create one source code that includes handling the null logic surrounded by an
if(has_nulls)
statement. The compiler will then create the two paths for us from the single source. We only need to invoke them individually as appropriate. What's more, the compiler optimizer will actually remove the entireif(has_nulls)
statement in thehas_nulls==false
compiled kernel path.Example of `has_nulls` templated functor invocation
Here is an example of invoking a
has_nulls
templated functor incpp/src/stream_compaction/drop_duplicates.cu
:cudf/cpp/src/stream_compaction/drop_duplicates.cu
Lines 145 to 173 in b45fd4d
Note the only difference between the
if
andelse
clauses is the instantiation of therow_equality_comparator
with the template parameter.The
row_equality_comparator
is defined in thecpp/include/cudf/table/row_operators.cuh
:cudf/cpp/include/cudf/table/row_operators.cuh
Lines 204 to 206 in b45fd4d
The
has_nulls
template argument is propagated theelement_equality_comparator
which has theif(has_nulls)
statement.Some of these invocations can be large and somewhat error prone.
https://github.com/rapidsai/cudf/blob/branch-0.18/cpp/src/sort/sort_impl.cuh#L68-L97
https://github.com/rapidsai/cudf/blob/branch-0.18/cpp/src/hash/hashing.cu#L138-L174
https://github.com/rapidsai/cudf/blob/branch-0.18/cpp/src/search/search.cu#L110-L142
https://github.com/rapidsai/cudf/blob/branch-0.18/cpp/src/groupby/sort/group_nunique.cu#L49-L90
Granted, some of these are only a few lines of code and clang formatting perhaps makes them look larger.
All of this means we can create a fast-path for non-null column cases but at the cost of compile time and size since much of the code is duplicated for us by the compiler. In some places, the null handling is a minimal set of instructions compared to the overall kernel code size. Also, in APIs that operate on tables (e.g. sort), a single null in any column in the table will cause the API to execute through the null-handling path for all tables/columns.
The dual compile path effects some of our biggest compile time offenders.
Here are the current top 30 source files ordered by compile time
The first column is time in milliseconds from a ninja trace on my desktop using g++7 and CUDA 11.0.
Ignoring the current re-ballooning of the reduction APIs, the sort and hashing source files are still very slow. These use the templated comparators and hashers defined in the
cpp/include/cudf/table/row_comparators.cuh
I propose to move some of the instances of the
has_nulls
template parameter to a functor member variable or function parameter as appropriate. Overall this means thathas_nulls
would be checked at runtime instead of compile time but the runtime if-statement only introduces a extra kernel instruction in thehas_nulls
path. Since thehas_nulls
value is set early in the logic before the kernel is launched, the kernel should therefore incur no divergence with other threads.I prototyped moving
has_nulls
in the row comparators and row hashers on my local machine which required updating about 20 source files. I compared the outputs for gbenchmarks tests for sort, hash, merge, search, and join and found no signficant change in performance. Most of the benchmarks do not include nulls so they were a good measure of the impact of the extra instruction. (I can attach the benchmark results if necessary).Results
Top 30 compile times with
has_nulls
converted to a runtime parameter.The overall improvement in compile time is about 7.5 minutes (30 min -> 22.5 min).
The compile size of
libcudf_base.so
is reduced by about 33MB (287MB -> 254MB).Only
cudf::merge
showed any significant performance drop but since merge implements its own comparator functors, re-instating it'shas_nulls
template implementation was not difficult. Andmerge.cu
is not a significant compile time concern (ranked at 46). This means, globally removing this kind of parameter may not be necessary. My proposal is to remove them only from the row comparators and row hashers as appropriate for now.This is a non-breaking change since it only effects internal functors and kernels functions.
The text was updated successfully, but these errors were encountered: