-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster struct row comparator #10164
Faster struct row comparator #10164
Conversation
Many iterations already happened. I just realized late that I should commit
Why do we need |
Yes. |
What's the overhead of just keeping those parents around? |
Cannot generalize but for a Struct of depth 8, ( |
Does that include the superimpose part of the process? And what's the memory overhead? |
Codecov Report
@@ Coverage Diff @@
## branch-22.04 #10164 +/- ##
================================================
+ Coverage 86.13% 86.17% +0.03%
================================================
Files 139 141 +2
Lines 22438 22501 +63
================================================
+ Hits 19328 19391 +63
Misses 3110 3110
Continue to review full report at Codecov.
|
All prep work (superimpose / nullmask to bool of old method etc.) are a very small part of the total time taken in is sorting use case. 5 ms out of the total 1776 ms in case of old method (superimpose + nullmask to bool) and 1 ms out of 659 ms for new method (superimpose only). This may vary in lighter, non-sort use cases.
That is just same as the byte size of the struct's nullmask. For the same 8 level deep struct column of length 1<<26, whose size in bytes is 320 MB (256 MB data + 8x8MB) we allocate an additional 64 MB. Compared to 512 MB for the additional nulls to bools. But this depends heavily on the struct's shape. All the null masks are going to be duplicated in byte size basically. I'm thinking about a method that can avoid superimpose nulls entirely but it would also require a device allocation for a small num_columns sized array. |
sliced no longer works
- Add a create method to preprocessed_table that takes care of checks and copies
static std::shared_ptr<preprocessed_table> create(table_view const& table, | ||
host_span<order const> column_order, | ||
host_span<null_order const> null_precedence, | ||
rmm::cuda_stream_view stream); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really want to force people to always construct as a shared_ptr
?
I suppose if someone is going through the trouble of constructing a preprocessed_table
themselves, then that means they are intending to use it in more than one comparator, in which case it will already need to be a shared_ptr
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exactly. Why not return a shared_ptr if everything that can accept it needs a shared_ptr.
return std::shared_ptr<preprocessed_table>(new preprocessed_table( | ||
std::move(d_t), std::move(d_column_order), std::move(d_null_precedence), std::move(d_depths))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make_shared
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't do it for the same reason table_device_view
couldn't. make_shared
needs a public ctor.
* @tparam Nullate A cudf::nullate type describing how to check for nulls. | ||
*/ | ||
template <typename Nullate> | ||
class device_row_comparator { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding element_comparator
as a nested type made me realize that device_row_comparator
isn't publicly constructible either. That makes me think we should nest it inside of self_comparator
. Especially since we'll need to add a different version for the non-self comparator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I purposefully left it out because then it can be used by both the one table comparator and the two table comparator. The one table comparator will use it directly while the two table comparator can wrap it in another index side aware class. That index side aware wrapper class can live inside two table comparator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple of minor suggestions, otherwise this LGTM now! Since it's still experimental and verticalization is an internal implementation detail I am fine merging it as is and revisiting naming in order to get ahead of burndown.
CUDF_EXPECTS(c.type().id() != type_id::LIST, | ||
"Cannot lexicographic compare a table with a LIST column"); | ||
if (not is_nested(c.type())) { | ||
CUDF_EXPECTS(is_relationally_comparable(c.type()), | ||
"Cannot lexicographic compare a table with a column of type " + | ||
jit::get_type_name(c.type())); | ||
} | ||
for (auto child = c.child_begin(); child < c.child_end(); ++child) { | ||
check_column(*child); | ||
} | ||
}; | ||
for (column_view const& c : input) { | ||
check_column(c); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nit: It might be cleaner to implement check_column
as a function returning a bool (maybe renamed to is_lex_compatible
or so) instead of one that raises. Then you could use std::any
(both for the main invocation and the recursive invocations inside the function) and call CUDF_EXPECTS
if the function returns false. Not a blocker for merge, might be nice for future work though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did consider it. But I wanted to have descriptive exceptions for exactly why this table is incompatible.
Co-authored-by: Vyas Ramasubramani <[email protected]>
@gpucibot merge |
…0483) Includes `<cstddef>` for `ptrdiff_t` in `parquet/compact_protocol_reader.hpp`. Compilation fails on GCC 11 without this include. Targeting 22.04 since this was broken yesterday in #10063. Error output: ``` cudf/cpp/src/io/parquet/compact_protocol_reader.hpp:51:17: error: 'ptrdiff_t' does not name a type 51 | [[nodiscard]] ptrdiff_t bytecount() const noexcept { return m_cur - m_base; } | cudf/cpp/src/io/parquet/compact_protocol_reader.hpp:22:1: note: 'ptrdiff_t' is defined in header '<cstddef>'; did you forget to '#include <cstddef>'? ``` Also includes `<optional>` in `cpp/include/cudf/table/experimental/row_operators.cuh`, which was broken by #10164. Authors: - Bradley Dice (https://github.com/bdice) Approvers: - Conor Hoekstra (https://github.com/codereport) - Yunsong Wang (https://github.com/PointKernel) URL: #10483
The existing
row_lexicographical_comparator
cannot compare struct columns, so the current solution is toflatten
a struct column with pre-order traversal. This involves creating a bool column for each struct level. e.g. for a struct of the following shapewe would generate columns like this:
[
bool(Struct(1))
,int
,bool(Struct(2))
,float
,string
]The reason this is done is because struct traversal in row comparator would require recursion, which is prohibitively expensive on the GPU because stack size cannot be determined at compile time. An alternative was also explored as part of my current effort.[1]
The proposed solution is to "verticalize" (please suggest a better name) the struct columns. This means the struct columns are converted into a format that does not require a stack storage and traversing it will require a state with fixed storage. For the above example struct, the conversion would yield 3 columns:
[
Struct(1)<int>
,Struct(1)<Struct(2)<float>>
,Struct(1)<Struct(2)<string>>
]Using this with row comparator required adding a loop that traverses down the hierarchy and only checks for nulls at the struct level. Since the hierarchy is guaranteed to have only one child, there is no stack required to keep track of the location in the hierarchy.
Further, it can be shown that the Parents that have appeared once in the transformed columns need not appear again because in a lexicographical comparison, they'd already have been compared. Thus the final transformed columns can look like this:
[
Struct(1)<int>
,Struct(2)<float>
,string
]This approach has 2 benefits:
num_rows {1<<24, 1<<26}
,depth {1, 8}
[1] The alternative was to convert recursion to iteration by constructing a manually controlled call stack with stack memory backed storage. This would be limited by the stack memory and was found to be more expensive than the current approach. The code for this is in row_operators2.cuh
API changes
This PR adds an owning type
self_comparator
that takes atable_view
and preprocesses it as mentioned and stores the necessary device objects needed for comparison. The owning type then provides a functor for use on the device.Another owning type is added called
preprocessed_table
which can also be constructed fromtable_view
and does the same preprocessing.self_comparator
can also be constructed from apreprocessed_table
. It is useful when trying to use the same preprocessed table in different comparators.