Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster struct row comparator #10164

Merged
merged 58 commits into from
Mar 22, 2022
Merged

Conversation

devavret
Copy link
Contributor

@devavret devavret commented Jan 28, 2022

The existing row_lexicographical_comparator cannot compare struct columns, so the current solution is to flatten a struct column with pre-order traversal. This involves creating a bool column for each struct level. e.g. for a struct of the following shape

Struct(1)<int, Struct(2)<float, string>>

we would generate columns like this:

[bool(Struct(1)), int, bool(Struct(2)), float, string]

The reason this is done is because struct traversal in row comparator would require recursion, which is prohibitively expensive on the GPU because stack size cannot be determined at compile time. An alternative was also explored as part of my current effort.[1]

The proposed solution is to "verticalize" (please suggest a better name) the struct columns. This means the struct columns are converted into a format that does not require a stack storage and traversing it will require a state with fixed storage. For the above example struct, the conversion would yield 3 columns:

[Struct(1)<int>, Struct(1)<Struct(2)<float>>, Struct(1)<Struct(2)<string>>]

Using this with row comparator required adding a loop that traverses down the hierarchy and only checks for nulls at the struct level. Since the hierarchy is guaranteed to have only one child, there is no stack required to keep track of the location in the hierarchy.

Further, it can be shown that the Parents that have appeared once in the transformed columns need not appear again because in a lexicographical comparison, they'd already have been compared. Thus the final transformed columns can look like this:

[Struct(1)<int>, Struct(2)<float>, string]

This approach has 2 benefits:

  1. The new transformation does not require the use of extra memory. The new views can be constructed from data and nullmask pointers from old views.
  2. Due to reading less data from device memory, sorting is faster by at least 34% but gets better with struct depth. Benchmark arguments: num_rows {1<<24, 1<<26}, depth {1, 8}
Comparing benchmarks/COMPARE_BENCH to benchmarks/COMPARE_BENCH_new
Benchmark                                                     Time             CPU      Time Old      Time New       CPU Old       CPU New
------------------------------------------------------------------------------------------------------------------------------------------
Sort<false>/unstable/16777216/1/manual_time                -0.3417         -0.3408            60            39            60            39
Sort<false>/unstable/67108864/1/manual_time                -0.3471         -0.3471           243           159           243           159
Sort<false>/unstable/16777216/8/manual_time                -0.6201         -0.6201           444           169           444           169
Sort<false>/unstable/67108864/8/manual_time                -0.6290         -0.6290          1776           659          1776           659

[1] The alternative was to convert recursion to iteration by constructing a manually controlled call stack with stack memory backed storage. This would be limited by the stack memory and was found to be more expensive than the current approach. The code for this is in row_operators2.cuh

API changes

This PR adds an owning type self_comparator that takes a table_view and preprocesses it as mentioned and stores the necessary device objects needed for comparison. The owning type then provides a functor for use on the device.

Another owning type is added called preprocessed_table which can also be constructed from table_view and does the same preprocessing. self_comparator can also be constructed from a preprocessed_table. It is useful when trying to use the same preprocessed table in different comparators.

@github-actions github-actions bot added CMake CMake build issue libcudf Affects libcudf (C++/CUDA) code. labels Jan 28, 2022
@devavret devavret requested a review from jrhemstad January 28, 2022 23:09
@jrhemstad
Copy link
Contributor

The transform still requires new memory for the purpose of holding the result of superimpose_parent_nulls()

Why do we need superimpose_parent_nulls in verticalization? I recall from when you first showed me this that it wouldn't require allocating new memory. Is this because of stripping off parents that have already appeared so you need to push the nulls down?

@devavret
Copy link
Contributor Author

Is this because of stripping off parents that have already appeared so you need to push the nulls down?

Yes.

@jrhemstad
Copy link
Contributor

Is this because of stripping off parents that have already appeared so you need to push the nulls down?

Yes.

What's the overhead of just keeping those parents around?

@devavret
Copy link
Contributor Author

Is this because of stripping off parents that have already appeared so you need to push the nulls down?

Yes.

What's the overhead of just keeping those parents around?

Cannot generalize but for a Struct of depth 8, (S<S<S<S<S<S<S<S<int, int>>>>>>>>) the runtime increases from 770ms to 1040ms. Compared to 2223ms for existing flattening method (I reduced the value distribution to create more cases where checking the second column would be required)

@jrhemstad
Copy link
Contributor

Cannot generalize but for a Struct of depth 8, (S<S<S<S<S<S<S<S<int, int>>>>>>>>) the runtime increases from 770ms to 1040ms.

Does that include the superimpose part of the process? And what's the memory overhead?

@codecov
Copy link

codecov bot commented Jan 29, 2022

Codecov Report

Merging #10164 (be37199) into branch-22.04 (0be0b00) will increase coverage by 0.03%.
The diff coverage is 96.06%.

❗ Current head be37199 differs from pull request most recent head 9bfd08e. Consider uploading reports for the commit 9bfd08e to get more accurate results

@@               Coverage Diff                @@
##           branch-22.04   #10164      +/-   ##
================================================
+ Coverage         86.13%   86.17%   +0.03%     
================================================
  Files               139      141       +2     
  Lines             22438    22501      +63     
================================================
+ Hits              19328    19391      +63     
  Misses             3110     3110              
Impacted Files Coverage Δ
python/dask_cudf/dask_cudf/tests/test_accessor.py 98.41% <ø> (ø)
python/cudf/cudf/core/_base_index.py 85.92% <50.00%> (-0.51%) ⬇️
python/cudf/cudf/core/column/decimal.py 91.30% <73.68%> (-1.01%) ⬇️
python/cudf/cudf/core/column/categorical.py 89.63% <84.61%> (-0.29%) ⬇️
python/dask_cudf/dask_cudf/io/text.py 85.29% <85.29%> (ø)
python/cudf/cudf/core/column/string.py 88.91% <94.44%> (+0.64%) ⬆️
python/cudf/cudf/core/column/numerical.py 95.62% <95.83%> (+0.64%) ⬆️
python/cudf/cudf/core/frame.py 91.84% <96.42%> (+0.12%) ⬆️
python/cudf/cudf/_typing.py 94.11% <100.00%> (+0.78%) ⬆️
python/cudf/cudf/api/types.py 89.79% <100.00%> (ø)
... and 32 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 61772d8...9bfd08e. Read the comment docs.

@devavret
Copy link
Contributor Author

Does that include the superimpose part of the process?

All prep work (superimpose / nullmask to bool of old method etc.) are a very small part of the total time taken in is sorting use case. 5 ms out of the total 1776 ms in case of old method (superimpose + nullmask to bool) and 1 ms out of 659 ms for new method (superimpose only). This may vary in lighter, non-sort use cases.

And what's the memory overhead?

That is just same as the byte size of the struct's nullmask. For the same 8 level deep struct column of length 1<<26, whose size in bytes is 320 MB (256 MB data + 8x8MB) we allocate an additional 64 MB. Compared to 512 MB for the additional nulls to bools. But this depends heavily on the struct's shape. All the null masks are going to be duplicated in byte size basically.

I'm thinking about a method that can avoid superimpose nulls entirely but it would also require a device allocation for a small num_columns sized array.

cpp/benchmarks/sort/sort_structs.cpp Outdated Show resolved Hide resolved
cpp/benchmarks/sort/sort_structs.cpp Outdated Show resolved Hide resolved
cpp/benchmarks/sort/sort_structs.cpp Outdated Show resolved Hide resolved
cpp/include/cudf/table/experimental/row_operators.cuh Outdated Show resolved Hide resolved
cpp/include/cudf/table/experimental/row_operators.cuh Outdated Show resolved Hide resolved
cpp/include/cudf/table/experimental/row_operators.cuh Outdated Show resolved Hide resolved
cpp/include/cudf/table/experimental/row_operators.cuh Outdated Show resolved Hide resolved
@devavret devavret requested a review from vyasr March 21, 2022 08:12
Comment on lines +280 to +283
static std::shared_ptr<preprocessed_table> create(table_view const& table,
host_span<order const> column_order,
host_span<null_order const> null_precedence,
rmm::cuda_stream_view stream);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really want to force people to always construct as a shared_ptr?

I suppose if someone is going through the trouble of constructing a preprocessed_table themselves, then that means they are intending to use it in more than one comparator, in which case it will already need to be a shared_ptr.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exactly. Why not return a shared_ptr if everything that can accept it needs a shared_ptr.

Comment on lines +186 to +187
return std::shared_ptr<preprocessed_table>(new preprocessed_table(
std::move(d_t), std::move(d_column_order), std::move(d_null_precedence), std::move(d_depths)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make_shared

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't do it for the same reason table_device_view couldn't. make_shared needs a public ctor.

* @tparam Nullate A cudf::nullate type describing how to check for nulls.
*/
template <typename Nullate>
class device_row_comparator {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding element_comparator as a nested type made me realize that device_row_comparator isn't publicly constructible either. That makes me think we should nest it inside of self_comparator. Especially since we'll need to add a different version for the non-self comparator.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I purposefully left it out because then it can be used by both the one table comparator and the two table comparator. The one table comparator will use it directly while the two table comparator can wrap it in another index side aware class. That index side aware wrapper class can live inside two table comparator.

Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of minor suggestions, otherwise this LGTM now! Since it's still experimental and verticalization is an internal implementation detail I am fine merging it as is and revisiting naming in order to get ahead of burndown.

cpp/include/cudf/table/experimental/row_operators.cuh Outdated Show resolved Hide resolved
Comment on lines +162 to +175
CUDF_EXPECTS(c.type().id() != type_id::LIST,
"Cannot lexicographic compare a table with a LIST column");
if (not is_nested(c.type())) {
CUDF_EXPECTS(is_relationally_comparable(c.type()),
"Cannot lexicographic compare a table with a column of type " +
jit::get_type_name(c.type()));
}
for (auto child = c.child_begin(); child < c.child_end(); ++child) {
check_column(*child);
}
};
for (column_view const& c : input) {
check_column(c);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit: It might be cleaner to implement check_column as a function returning a bool (maybe renamed to is_lex_compatible or so) instead of one that raises. Then you could use std::any (both for the main invocation and the recursive invocations inside the function) and call CUDF_EXPECTS if the function returns false. Not a blocker for merge, might be nice for future work though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did consider it. But I wanted to have descriptive exceptions for exactly why this table is incompatible.

@devavret devavret requested a review from jrhemstad March 21, 2022 19:37
@devavret
Copy link
Contributor Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit e7dba35 into rapidsai:branch-22.04 Mar 22, 2022
rapids-bot bot pushed a commit that referenced this pull request Mar 22, 2022
…0483)

Includes `<cstddef>` for `ptrdiff_t` in `parquet/compact_protocol_reader.hpp`. Compilation fails on GCC 11 without this include. Targeting 22.04 since this was broken yesterday in #10063.

Error output:
```
cudf/cpp/src/io/parquet/compact_protocol_reader.hpp:51:17: error: 'ptrdiff_t' does not name a type
   51 |   [[nodiscard]] ptrdiff_t bytecount() const noexcept { return m_cur - m_base; }
      |
cudf/cpp/src/io/parquet/compact_protocol_reader.hpp:22:1: note: 'ptrdiff_t' is defined in header '<cstddef>'; did you forget to '#include <cstddef>'?
```

Also includes `<optional>` in `cpp/include/cudf/table/experimental/row_operators.cuh`, which was broken by #10164.

Authors:
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - Conor Hoekstra (https://github.com/codereport)
  - Yunsong Wang (https://github.com/PointKernel)

URL: #10483
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2 - In Progress Currently a work in progress CMake CMake build issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Performance Performance related issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants