Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add minhash support for MurmurHash3_x64_128 #13796

Merged
merged 27 commits into from
Aug 21, 2023
Merged
Show file tree
Hide file tree
Changes from 9 commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
6004d35
Add minhash support for MurmurHash3_x64_128
davidwendt Aug 1, 2023
f51287d
Merge branch 'branch-23.10' into fea-minhash64
davidwendt Aug 1, 2023
acec8a8
Merge branch 'branch-23.10' into fea-minhash64
davidwendt Aug 3, 2023
9775001
Merge branch 'branch-23.10' into fea-minhash64
davidwendt Aug 6, 2023
6ef09ba
Merge branch 'branch-23.10' into fea-minhash64
davidwendt Aug 7, 2023
ef22254
Merge branch 'branch-23.10' into fea-minhash64
davidwendt Aug 8, 2023
74a11ea
Merge branch 'branch-23.10' into fea-minhash64
davidwendt Aug 10, 2023
35ce135
add minhash64 to benchmarks
davidwendt Aug 10, 2023
287e3d9
Merge branch 'branch-23.10' into fea-minhash64
davidwendt Aug 10, 2023
72b66c2
add const decl
davidwendt Aug 11, 2023
2389901
Merge branch 'branch-23.10' into fea-minhash64
davidwendt Aug 11, 2023
0ecc9c6
add comment about using only the first value from murmurhash3_x64_128
davidwendt Aug 11, 2023
180c619
Remove errors about hash_function.
bdice Aug 11, 2023
706b1e4
rework multi-seed to always return list
davidwendt Aug 11, 2023
fd2f505
Merge branch 'branch-23.10' into fea-minhash64
davidwendt Aug 11, 2023
b335f1f
Merge branch 'fea-minhash64' of github.com:davidwendt/cudf into fea-m…
davidwendt Aug 11, 2023
07483a5
Merge branch 'branch-23.10' into fea-minhash64
davidwendt Aug 14, 2023
6a5226f
Merge branch 'branch-23.10' into fea-minhash64
davidwendt Aug 14, 2023
e2fd975
Merge branch 'branch-23.10' into fea-minhash64
davidwendt Aug 15, 2023
df38760
fix incorrect comments
davidwendt Aug 16, 2023
3780d96
Merge branch 'branch-23.10' into fea-minhash64
davidwendt Aug 16, 2023
9052e6f
Merge branch 'branch-23.10' into fea-minhash64
davidwendt Aug 17, 2023
23d88ca
Merge branch 'branch-23.10' into fea-minhash64
davidwendt Aug 17, 2023
6690e27
Merge branch 'branch-23.10' into fea-minhash64
davidwendt Aug 17, 2023
b5fde2e
Merge branch 'branch-23.10' into fea-minhash64
davidwendt Aug 18, 2023
638a65d
Merge branch 'branch-23.10' into fea-minhash64
davidwendt Aug 18, 2023
b347eca
fix some castings
davidwendt Aug 18, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 10 additions & 7 deletions cpp/benchmarks/text/minhash.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ static void bench_minhash(nvbench::state& state)
auto const row_width = static_cast<cudf::size_type>(state.get_int64("row_width"));
auto const hash_width = static_cast<cudf::size_type>(state.get_int64("hash_width"));
auto const seed_count = static_cast<cudf::size_type>(state.get_int64("seed_count"));
auto const b64 = state.get_int64("htype") == 64;
davidwendt marked this conversation as resolved.
Show resolved Hide resolved

if (static_cast<std::size_t>(num_rows) * static_cast<std::size_t>(row_width) >=
static_cast<std::size_t>(std::numeric_limits<cudf::size_type>::max())) {
Expand All @@ -44,9 +45,9 @@ static void bench_minhash(nvbench::state& state)

data_profile const seeds_profile = data_profile_builder().null_probability(0).distribution(
cudf::type_to_id<cudf::hash_value_type>(), distribution_id::NORMAL, 0, row_width);
auto const seeds_table = create_random_table(
{cudf::type_to_id<cudf::hash_value_type>()}, row_count{seed_count}, seeds_profile);
auto seeds = seeds_table->get_column(0);
auto const seed_type = b64 ? cudf::type_id::UINT64 : cudf::type_id::UINT32;
auto const seeds_table = create_random_table({seed_type}, row_count{seed_count}, seeds_profile);
auto seeds = seeds_table->get_column(0);
seeds.set_null_mask(rmm::device_buffer{}, 0);

state.set_cuda_stream(nvbench::make_cuda_stream_view(cudf::get_default_stream().value()));
Expand All @@ -56,13 +57,15 @@ static void bench_minhash(nvbench::state& state)
state.add_global_memory_writes<nvbench::int32_t>(num_rows); // output are hashes

state.exec(nvbench::exec_tag::sync, [&](nvbench::launch& launch) {
auto result = nvtext::minhash(input, seeds.view(), hash_width);
auto result = b64 ? nvtext::minhash64(input, seeds.view(), hash_width)
: nvtext::minhash(input, seeds.view(), hash_width);
});
}

NVBENCH_BENCH(bench_minhash)
.set_name("minhash")
.add_int64_axis("num_rows", {1024, 4096, 8192, 16364, 32768, 262144})
.add_int64_axis("num_rows", {1024, 8192, 16364, 131072})
.add_int64_axis("row_width", {128, 512, 2048})
.add_int64_axis("hash_width", {5, 10, 25})
.add_int64_axis("seed_count", {2, 26});
.add_int64_axis("hash_width", {5, 10})
.add_int64_axis("seed_count", {2, 26})
.add_int64_axis("htype", {32, 64});
82 changes: 69 additions & 13 deletions cpp/include/nvtext/minhash.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -36,24 +36,23 @@ namespace nvtext {
*
* Any null row entries result in corresponding null output rows.
*
* This function uses MurmurHash3_x86_32 for the hash algorithm.
*
* @throw std::invalid_argument if the width < 2
* @throw std::invalid_argument if hash_function is not HASH_MURMUR3
*
* @param input Strings column to compute minhash
* @param seed Seed value used for the MurmurHash3_x86_32 algorithm
* @param seed Seed value used for the hash algorithm
* @param width The character width used for apply substrings;
* Default is 4 characters.
* @param hash_function Hash algorithm to use;
* Only HASH_MURMUR3 is currently supported.
* @param mr Device memory resource used to allocate the returned column's device memory
* @return Minhash values for each string in input
*/
std::unique_ptr<cudf::column> minhash(
cudf::strings_column_view const& input,
cudf::numeric_scalar<cudf::hash_value_type> seed = cudf::numeric_scalar(cudf::DEFAULT_HASH_SEED),
cudf::size_type width = 4,
cudf::hash_id hash_function = cudf::hash_id::HASH_MURMUR3,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());
cudf::numeric_scalar<uint32_t> seed = 0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why switch to hardcoding the value here instead of using the constant?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the type may be different. I'd rather be clear that the default seed is actually 0 and would not want to change that if the rest of libcudf decided on a different default. Hopefully that is ok.

cudf::size_type width = 4,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/**
* @brief Returns the minhash values for each string per seed
Expand All @@ -64,6 +63,8 @@ std::unique_ptr<cudf::column> minhash(
* string. The order of the elements in each row match the order of
* the seeds provided in the `seeds` parameter.
*
* This function uses MurmurHash3_x86_32 for the hash algorithm.
*
* Any null row entries result in corresponding null output rows.
*
* @throw std::invalid_argument if the width < 2
Expand All @@ -72,20 +73,75 @@ std::unique_ptr<cudf::column> minhash(
* @throw std::overflow_error if `seeds * input.size()` exceeds the column size limit
*
* @param input Strings column to compute minhash
* @param seeds Seed values used for the MurmurHash3_x86_32 algorithm
* @param seeds Seed values used for the hash algorithm
* @param width The character width used for apply substrings;
* Default is 4 characters.
* @param hash_function Hash algorithm to use;
* Only HASH_MURMUR3 is currently supported.
* @param mr Device memory resource used to allocate the returned column's device memory
* @return List column of minhash values for each string per seed
* or a hash_value_type column if only a single seed is specified
* or a UINT32 type column if only a single seed is specified
*/
std::unique_ptr<cudf::column> minhash(
cudf::strings_column_view const& input,
cudf::device_span<cudf::hash_value_type const> seeds,
cudf::device_span<uint32_t const> seeds,
cudf::size_type width = 4,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/**
* @brief Returns the minhash value for each string
*
* Hash values are computed from substrings of each string and the
* minimum hash value is returned for each string.
*
* Any null row entries result in corresponding null output rows.
*
* This function uses MurmurHash3_x64_128 for the hash algorithm.
davidwendt marked this conversation as resolved.
Show resolved Hide resolved
*
* @throw std::invalid_argument if the width < 2
* @throw std::invalid_argument if hash_function is not HASH_MURMUR3
davidwendt marked this conversation as resolved.
Show resolved Hide resolved
*
* @param input Strings column to compute minhash
* @param seed Seed value used for the hash algorithm
* @param width The character width used for apply substrings;
* Default is 4 characters.
* @param mr Device memory resource used to allocate the returned column's device memory
* @return Minhash values as UINT64 for each string in input
*/
std::unique_ptr<cudf::column> minhash64(
cudf::strings_column_view const& input,
cudf::numeric_scalar<uint64_t> seed = 0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question as before, why the different seed choice? Is it because of the potential for unsafe casts depending on the type of DEFAULT_HASH_SEED (currently OK since it's uint32_t)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the type is different here and I'd rather the 2 functions be consistent over any need to use the constant def.

cudf::size_type width = 4,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/**
* @brief Returns the minhash values for each string per seed
*
* Hash values are computed from substrings of each string and the
* minimum hash value is returned for each string for each seed.
* Each row of the list column are seed results for the corresponding
* string. The order of the elements in each row match the order of
* the seeds provided in the `seeds` parameter.
*
* This function uses MurmurHash3_x64_128 for the hash algorithm.
*
* Any null row entries result in corresponding null output rows.
*
* @throw std::invalid_argument if the width < 2
* @throw std::invalid_argument if hash_function is not HASH_MURMUR3
* @throw std::invalid_argument if seeds is empty
* @throw std::overflow_error if `seeds * input.size()` exceeds the column size limit
*
* @param input Strings column to compute minhash
* @param seeds Seed values used for the hash algorithm
* @param width The character width used for apply substrings;
* Default is 4 characters.
* @param mr Device memory resource used to allocate the returned column's device memory
* @return List column of minhash values for each string per seed
* or a UINT64 type column if only a single seed is specified
*/
std::unique_ptr<cudf::column> minhash64(
cudf::strings_column_view const& input,
cudf::device_span<uint64_t const> seeds,
cudf::size_type width = 4,
cudf::hash_id hash_function = cudf::hash_id::HASH_MURMUR3,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/** @} */ // end of group
Expand Down
Loading