-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Struct hashing support for SerialMurmur3 and SparkMurmur3 #7714
Changes from all commits
d471524
12d7197
32ffab1
87687c9
1fdcdf5
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,5 @@ | ||
/* | ||
* Copyright (c) 2019-2020, NVIDIA CORPORATION. | ||
* Copyright (c) 2019-2021, NVIDIA CORPORATION. | ||
* | ||
* Licensed under the Apache License, Version 2.0 (the "License"); | ||
* you may not use this file except in compliance with the License. | ||
|
@@ -29,6 +29,8 @@ | |
|
||
#include <rmm/cuda_stream_view.hpp> | ||
|
||
#include <algorithm> | ||
|
||
namespace cudf { | ||
namespace { | ||
|
||
|
@@ -38,6 +40,22 @@ bool md5_type_check(data_type dt) | |
return !is_chrono(dt) && (is_fixed_width(dt) || (dt.id() == type_id::STRING)); | ||
} | ||
|
||
template <typename IterType> | ||
std::vector<column_view> to_leaf_columns(IterType iter_begin, IterType iter_end) | ||
{ | ||
std::vector<column_view> leaf_columns; | ||
std::for_each(iter_begin, iter_end, [&leaf_columns](column_view const& col) { | ||
if (is_nested(col.type())) { | ||
CUDF_EXPECTS(col.type().id() == type_id::STRUCT, "unsupported nested type"); | ||
auto child_columns = to_leaf_columns(col.child_begin(), col.child_end()); | ||
leaf_columns.insert(leaf_columns.end(), child_columns.begin(), child_columns.end()); | ||
} else { | ||
leaf_columns.emplace_back(col); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should I be worried that the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. OK. I was hoping that somehow the two branches could be made more symmetric to avoid confusion. But I looked at how to do it and I couldn't find a way. |
||
} | ||
}); | ||
return leaf_columns; | ||
} | ||
|
||
} // namespace | ||
|
||
namespace detail { | ||
|
@@ -133,10 +151,11 @@ std::unique_ptr<column> serial_murmur_hash3_32(table_view const& input, | |
|
||
if (input.num_columns() == 0 || input.num_rows() == 0) { return output; } | ||
|
||
auto const device_input = table_device_view::create(input, stream); | ||
table_view const leaf_table(to_leaf_columns(input.begin(), input.end())); | ||
auto const device_input = table_device_view::create(leaf_table, stream); | ||
auto output_view = output->mutable_view(); | ||
|
||
if (has_nulls(input)) { | ||
if (has_nulls(leaf_table)) { | ||
thrust::tabulate(rmm::exec_policy(stream), | ||
output_view.begin<int32_t>(), | ||
output_view.end<int32_t>(), | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this just be a static_assert?
cudf_assert
will only fail in a debug build, and only at runtime. This path should result in a compile-time error, shouldn't it?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, static_assert would be ideal. I originally tried a static assert, but I believe the type dispatcher causes it to be referenced in the switch statement and fails the build. I think that's why
cudf_assert
is used in some other hash functions in this header file for unsupported types.That's unfortunate. Is there a separate assertion utility that should be used here that will fire in a release build as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CUDF_FAIL?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is device code.
CUDF_FAIL
won't work.release_assert
used to do that, but then we found out it massively increased our binary size #7583There's no good way to do error checking from device code like this that doesn't have a significant impact on performance. We just need to make sure to catch these kinds of things on the host beforehand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about a macro that simply asserts on the device in a release build without any lengthy string stuff such as function, file and line number? Not super useful for debugging the problem, but it would prevent the code from just running and returning some bogus result without an error in case the host-side checks aren't there or working properly. Does that still bloat the binary or harm performance in some way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the mere presence of an
assert
, even unexecuted, impacts performance.With
do_asserts = false
:without -DNDEBUG:
repetitions:100 median: 3.02846ms min: 2.95219ms p20: 2.99827ms p80: 3.05152ms max: 5.3207ms
with -DNDEBUG:
repetitions:100 median: 2.816ms min: 2.7904ms p20: 2.80781ms p80: 2.82726ms max: 2.84467ms
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should have debug builds getting tested under CI at least on a nightly/weekly basis in 0.20 which will help flag any code going down a path it shouldn't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So what's the action here @jlowe @jrhemstad ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I understand it, there's nothing more to do. The host code is already checking the types before attempting to dispatch, and these specializations using
cudf_assert
are only in place just in case somehow they get invoked errantly in practice. The performance hit from using an actual device assert in release code is too great, socudf_assert
is as close as we can get, IIUC.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I thought Jake was saying just having a
cudf_assert
was expensive. Misunderstood.