Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add nvtext::minhash function #12961

Merged
merged 89 commits into from
Apr 21, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
89 commits
Select commit Hold shift + click to select a range
35885c5
Add nvtext::minhash function
davidwendt Mar 16, 2023
a014367
fix missing parameter name in function declaration
davidwendt Mar 16, 2023
a6b40f4
fix typo in doxygen comment
davidwendt Mar 17, 2023
9016f86
Merge branch 'branch-23.04' into text-minhashing
davidwendt Mar 17, 2023
a31e107
Merge branch 'branch-23.04' into text-minhashing
davidwendt Mar 17, 2023
51d64f8
Merge branch 'branch-23.04' into text-minhashing
davidwendt Mar 17, 2023
5226d7f
add cython/python interface to nvtext::minhash
davidwendt Mar 17, 2023
21af847
fix style violations
davidwendt Mar 17, 2023
3d9ad5f
Merge branch 'branch-23.04' into text-minhashing
davidwendt Mar 17, 2023
8329b92
Merge branch 'branch-23.04' into text-minhashing
davidwendt Mar 20, 2023
9327813
add benchmark
davidwendt Mar 20, 2023
d441b0d
fix style violation
davidwendt Mar 20, 2023
3016c59
Merge branch 'branch-23.04' into text-minhashing
davidwendt Mar 20, 2023
e138476
Merge branch 'branch-23.04' into text-minhashing
davidwendt Mar 20, 2023
3330209
Merge branch 'text-minhashing' of github.com:davidwendt/cudf into tex…
davidwendt Mar 20, 2023
8d206fa
add doxygen group
davidwendt Mar 20, 2023
5ce07ba
Merge branch 'branch-23.04' into text-minhashing
davidwendt Mar 21, 2023
921c225
Merge branch 'branch-23.04' into text-minhashing
davidwendt Mar 21, 2023
202bf12
Merge branch 'branch-23.04' into text-minhashing
davidwendt Mar 21, 2023
6efdf21
Merge branch 'branch-23.04' into text-minhashing
davidwendt Mar 22, 2023
d7d947a
fix long strings issue
davidwendt Mar 22, 2023
33e5f33
Merge branch 'branch-23.04' into text-minhashing
davidwendt Mar 22, 2023
9e66eec
rework as warp parallel kernel
davidwendt Mar 23, 2023
3f8b39c
Merge branch 'branch-23.04' into text-minhashing
davidwendt Mar 23, 2023
c701df8
add multi-seed libcudf API
davidwendt Mar 24, 2023
21db66b
fix benchmark call
davidwendt Mar 24, 2023
bcc94e0
Merge branch 'branch-23.04' into text-minhashing
davidwendt Mar 24, 2023
0e3c4e3
change cython/python to use multi-seed API
davidwendt Mar 24, 2023
d8f06f3
move const itr vars outside the for
davidwendt Mar 24, 2023
9fde3d4
Merge branch 'branch-23.04' into text-minhashing
davidwendt Mar 24, 2023
41c14a2
switch for-loops and use atomicMin
davidwendt Mar 24, 2023
da78c2c
Merge branch 'branch-23.04' into text-minhashing
davidwendt Mar 27, 2023
28cf712
support seeds default parameter
davidwendt Mar 27, 2023
491342d
add multi-seed support to benchmark
davidwendt Mar 27, 2023
312ed6f
Merge branch 'branch-23.04' into text-minhashing
davidwendt Mar 28, 2023
68660ec
add hash function parameter
davidwendt Mar 28, 2023
b0154eb
Merge branch 'branch-23.06' into text-minhashing
davidwendt Mar 29, 2023
a07a25b
fix dstr.length <= width edge case
davidwendt Mar 29, 2023
d65ad8f
add more tests
davidwendt Mar 29, 2023
b4ca094
Merge branch 'branch-23.06' into text-minhashing
davidwendt Apr 3, 2023
256d7e3
move hash-id parameter to the end
davidwendt Apr 3, 2023
ff624dd
Merge branch 'branch-23.06' into text-minhashing
davidwendt Apr 4, 2023
0423e82
fix race condition on initializing hash output
davidwendt Apr 4, 2023
0684ba8
Merge branch 'branch-23.06' into text-minhashing
davidwendt Apr 5, 2023
e3f0aa5
Merge branch 'branch-23.06' into text-minhashing
davidwendt Apr 6, 2023
bd5d660
add call to sanitize nulls
davidwendt Apr 6, 2023
013ef44
Merge branch 'branch-23.06' into text-minhashing
davidwendt Apr 6, 2023
e9be32a
Merge branch 'branch-23.06' into text-minhashing
davidwendt Apr 7, 2023
1297345
Merge branch 'branch-23.06' into text-minhashing
davidwendt Apr 7, 2023
a00c959
use thrust::fill to init the output
davidwendt Apr 7, 2023
e192038
fix doxygen for multi-seed API
davidwendt Apr 7, 2023
019d183
Merge branch 'branch-23.06' into text-minhashing
davidwendt Apr 7, 2023
49db1e3
fix some comments
davidwendt Apr 7, 2023
ea331a2
Merge branch 'branch-23.06' into text-minhashing
davidwendt Apr 10, 2023
3303a88
Merge branch 'branch-23.06' into text-minhashing
davidwendt Apr 11, 2023
20ead37
Merge branch 'branch-23.06' into text-minhashing
davidwendt Apr 12, 2023
bc0102d
Merge branch 'branch-23.06' into text-minhashing
davidwendt Apr 12, 2023
36e3a65
Merge branch 'branch-23.06' into text-minhashing
davidwendt Apr 13, 2023
04b9229
use Optional[cudf.Series] declaration
davidwendt Apr 13, 2023
de0a93c
Merge branch 'text-minhashing' of github.com:davidwendt/cudf into tex…
davidwendt Apr 13, 2023
1bad8e2
add overflow check for seeds*input-rows
davidwendt Apr 13, 2023
a918b65
support std::size_t for for-each-n in warp-per-string functor
davidwendt Apr 13, 2023
b920e72
Merge branch 'text-minhashing' of github.com:davidwendt/cudf into tex…
davidwendt Apr 14, 2023
4c21c7c
Merge branch 'branch-23.06' into text-minhashing
davidwendt Apr 14, 2023
f3a1d6c
Merge branch 'branch-23.06' into text-minhashing
davidwendt Apr 14, 2023
6dcc042
Merge branch 'branch-23.06' into text-minhashing
davidwendt Apr 14, 2023
472108b
add tests for error cases
davidwendt Apr 14, 2023
134aa70
Merge branch 'branch-23.06' into text-minhashing
davidwendt Apr 14, 2023
1c564f4
Merge branch 'text-minhashing' of github.com:davidwendt/cudf into tex…
davidwendt Apr 16, 2023
94a52db
Merge branch 'branch-23.06' into text-minhashing
davidwendt Apr 16, 2023
92943d6
Merge branch 'branch-23.06' into text-minhashing
davidwendt Apr 16, 2023
386eea2
Merge branch 'branch-23.06' into text-minhashing
davidwendt Apr 17, 2023
f49e60e
Merge branch 'text-minhashing' of github.com:davidwendt/cudf into tex…
davidwendt Apr 17, 2023
a82c074
fix style violation
davidwendt Apr 17, 2023
789dd67
Merge branch 'branch-23.06' into text-minhashing
davidwendt Apr 17, 2023
6dc2430
Merge branch 'branch-23.06' into text-minhashing
davidwendt Apr 18, 2023
5583c9a
Merge branch 'branch-23.06' into text-minhashing
davidwendt Apr 18, 2023
f7aa11c
Merge branch 'branch-23.06' into text-minhashing
davidwendt Apr 18, 2023
5975167
fix doxygen comments
davidwendt Apr 18, 2023
1a04853
Merge branch 'text-minhashing' of github.com:davidwendt/cudf into tex…
davidwendt Apr 18, 2023
5d457f0
Merge branch 'branch-23.06' into text-minhashing
davidwendt Apr 18, 2023
0a06347
Merge branch 'branch-23.06' into text-minhashing
davidwendt Apr 19, 2023
6fc724f
fix merge conflict
davidwendt Apr 20, 2023
43135a2
Merge branch 'text-minhashing' of github.com:davidwendt/cudf into tex…
davidwendt Apr 20, 2023
9a13a3c
Merge branch 'branch-23.06' into text-minhashing
davidwendt Apr 20, 2023
4f1a1b0
remove unused cimport
davidwendt Apr 20, 2023
431897d
Merge branch 'branch-23.06' into text-minhashing
davidwendt Apr 20, 2023
9685210
Merge branch 'branch-23.06' into text-minhashing
davidwendt Apr 20, 2023
3a3656c
Merge branch 'branch-23.06' into text-minhashing
davidwendt Apr 20, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions cpp/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -580,6 +580,7 @@ add_library(
src/text/detokenize.cu
src/text/edit_distance.cu
src/text/generate_ngrams.cu
src/text/minhash.cu
src/text/ngrams_tokenize.cu
src/text/normalize.cu
src/text/replace.cu
Expand Down
3 changes: 2 additions & 1 deletion cpp/benchmarks/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -274,6 +274,8 @@ ConfigureBench(
text/subword.cpp text/tokenize.cpp
)

ConfigureNVBench(TEXT_NVBENCH text/minhash.cpp)

# ##################################################################################################
# * strings benchmark -------------------------------------------------------------------
ConfigureBench(
Expand All @@ -290,7 +292,6 @@ ConfigureBench(
string/factory.cu
string/filter.cpp
string/find.cpp
string/like.cpp
karthikeyann marked this conversation as resolved.
Show resolved Hide resolved
string/repeat_strings.cpp
string/replace.cpp
string/replace_re.cpp
Expand Down
69 changes: 69 additions & 0 deletions cpp/benchmarks/text/minhash.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
/*
* Copyright (c) 2023, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

#include <benchmarks/common/generate_input.hpp>
#include <benchmarks/fixture/rmm_pool_raii.hpp>

#include <cudf/strings/strings_column_view.hpp>

#include <nvtext/minhash.hpp>

#include <nvbench/nvbench.cuh>

#include <rmm/device_buffer.hpp>

static void bench_minhash(nvbench::state& state)
{
auto const num_rows = static_cast<cudf::size_type>(state.get_int64("num_rows"));
auto const row_width = static_cast<cudf::size_type>(state.get_int64("row_width"));
auto const hash_width = static_cast<cudf::size_type>(state.get_int64("hash_width"));
auto const seed_count = static_cast<cudf::size_type>(state.get_int64("seed_count"));

if (static_cast<std::size_t>(num_rows) * static_cast<std::size_t>(row_width) >=
static_cast<std::size_t>(std::numeric_limits<cudf::size_type>::max())) {
state.skip("Skip benchmarks greater than size_type limit");
}

data_profile const strings_profile = data_profile_builder().distribution(
cudf::type_id::STRING, distribution_id::NORMAL, 0, row_width);
auto const strings_table =
create_random_table({cudf::type_id::STRING}, row_count{num_rows}, strings_profile);
cudf::strings_column_view input(strings_table->view().column(0));

data_profile const seeds_profile = data_profile_builder().null_probability(0).distribution(
cudf::type_to_id<cudf::hash_value_type>(), distribution_id::NORMAL, 0, row_width);
auto const seeds_table = create_random_table(
{cudf::type_to_id<cudf::hash_value_type>()}, row_count{seed_count}, seeds_profile);
auto seeds = seeds_table->get_column(0);
seeds.set_null_mask(rmm::device_buffer{}, 0);

state.set_cuda_stream(nvbench::make_cuda_stream_view(cudf::get_default_stream().value()));

auto chars_size = input.chars_size();
state.add_global_memory_reads<nvbench::int8_t>(chars_size);
state.add_global_memory_writes<nvbench::int32_t>(num_rows); // output are hashes

state.exec(nvbench::exec_tag::sync, [&](nvbench::launch& launch) {
auto result = nvtext::minhash(input, seeds.view(), hash_width);
});
}

NVBENCH_BENCH(bench_minhash)
.set_name("minhash")
.add_int64_axis("num_rows", {1024, 4096, 8192, 16364, 32768, 262144})
.add_int64_axis("row_width", {128, 512, 2048})
.add_int64_axis("hash_width", {5, 10, 25})
.add_int64_axis("seed_count", {2, 26});
1 change: 1 addition & 0 deletions cpp/include/doxygen_groups.h
Original file line number Diff line number Diff line change
Expand Up @@ -164,6 +164,7 @@
* @defgroup nvtext_edit_distance Edit Distance
* @defgroup nvtext_tokenize Tokenizing
* @defgroup nvtext_replace Replacing
* @defgroup nvtext_minhash MinHashing
davidwendt marked this conversation as resolved.
Show resolved Hide resolved
* @}
* @defgroup utility_apis Utilities
* @{
Expand Down
91 changes: 91 additions & 0 deletions cpp/include/nvtext/minhash.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
/*
* Copyright (c) 2023, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
#pragma once

#include <cudf/column/column.hpp>
#include <cudf/hashing.hpp>
#include <cudf/scalar/scalar.hpp>
#include <cudf/strings/strings_column_view.hpp>
#include <cudf/utilities/span.hpp>

namespace nvtext {
/**
* @addtogroup nvtext_minhash
* @{
* @file
*/

/**
* @brief Returns the minhash value for each string
*
* Hash values are computed from substrings of each string and the
* minimum hash value is returned for each string.
*
* Any null row entries result in corresponding null output rows.
*
* @throw std::invalid_argument if the width < 2
* @throw std::invalid_argument if hash_function is not HASH_MURMUR3
*
* @param input Strings column to compute minhash
* @param seed Seed value used for the MurmurHash3_32 algorithm
* @param width The character width used for apply substrings;
* Default is 4 characters.
* @param hash_function Hash algorithm to use;
* Only HASH_MURMUR3 is currently supported.
* @param mr Device memory resource used to allocate the returned column's device memory
* @return Minhash values for each string in input
*/
std::unique_ptr<cudf::column> minhash(
cudf::strings_column_view const& input,
cudf::numeric_scalar<cudf::hash_value_type> seed = cudf::numeric_scalar(cudf::DEFAULT_HASH_SEED),
cudf::size_type width = 4,
cudf::hash_id hash_function = cudf::hash_id::HASH_MURMUR3,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/**
* @brief Returns the minhash values for each string per seed
*
* Hash values are computed from substrings of each string and the
* minimum hash value is returned for each string for each seed.
* Each row of the list column are seed results for the corresponding
* string. The order of the elements in each row match the order of
* the seeds provided in the `seeds` parameter.
*
* Any null row entries result in corresponding null output rows.
*
* @throw std::invalid_argument if the width < 2
* @throw std::invalid_argument if hash_function is not HASH_MURMUR3
* @throw std::invalid_argument if seeds is empty
*
* @param input Strings column to compute minhash
* @param seeds Seed values used for the MurmurHash3_32 algorithm
* @param width The character width used for apply substrings;
* Default is 4 characters.
* @param hash_function Hash algorithm to use;
* Only HASH_MURMUR3 is currently supported.
* @param mr Device memory resource used to allocate the returned column's device memory
* @return List column of minhash values for each string per seed
* or a hash_value_type column if only a single seed is specified
*/
std::unique_ptr<cudf::column> minhash(
cudf::strings_column_view const& input,
cudf::device_span<cudf::hash_value_type const> seeds,
karthikeyann marked this conversation as resolved.
Show resolved Hide resolved
cudf::size_type width = 4,
cudf::hash_id hash_function = cudf::hash_id::HASH_MURMUR3,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/** @} */ // end of group
} // namespace nvtext
194 changes: 194 additions & 0 deletions cpp/src/text/minhash.cu
Original file line number Diff line number Diff line change
@@ -0,0 +1,194 @@
/*
* Copyright (c) 2023, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

#include <nvtext/minhash.hpp>

#include <cudf/column/column.hpp>
#include <cudf/column/column_device_view.cuh>
#include <cudf/column/column_factories.hpp>
#include <cudf/detail/copy.hpp>
#include <cudf/detail/hashing.hpp>
#include <cudf/detail/null_mask.hpp>
#include <cudf/detail/nvtx/ranges.hpp>
#include <cudf/detail/sequence.hpp>
#include <cudf/detail/utilities/cuda.cuh>
#include <cudf/detail/utilities/device_atomics.cuh>
#include <cudf/detail/utilities/hash_functions.cuh>
#include <cudf/strings/string_view.cuh>
#include <cudf/utilities/default_stream.hpp>
#include <cudf/utilities/error.hpp>

#include <rmm/cuda_stream_view.hpp>
#include <rmm/exec_policy.hpp>

#include <thrust/execution_policy.h>
#include <thrust/fill.h>
#include <thrust/for_each.h>
#include <thrust/iterator/counting_iterator.h>

#include <limits>

namespace nvtext {
namespace detail {
namespace {

/**
* @brief Compute the minhash of each string for each seed
*
* This is a warp-per-string algorithm where parallel threads within a warp
* work on substrings of a single string row.
*/
struct minhash_fn {
cudf::column_device_view d_strings;
cudf::device_span<cudf::hash_value_type const> seeds;
cudf::size_type width;
cudf::hash_value_type* d_hashes;

__device__ void operator()(std::size_t idx)
{
auto const str_idx = static_cast<cudf::size_type>(idx / cudf::detail::warp_size);
auto const lane_idx = static_cast<cudf::size_type>(idx % cudf::detail::warp_size);

if (d_strings.is_null(str_idx)) { return; }

auto const d_str = d_strings.element<cudf::string_view>(str_idx);
auto const d_output = d_hashes + (str_idx * seeds.size());

// initialize hashes output for this string
if (lane_idx == 0) {
auto const init = d_str.empty() ? 0 : std::numeric_limits<cudf::hash_value_type>::max();
thrust::fill(thrust::seq, d_output, d_output + seeds.size(), init);
}
__syncwarp();

auto const begin = d_str.begin() + lane_idx;
auto const end = [d_str, width = width] {
auto const length = d_str.length();
if (length > width) { return (d_str.end() - (width - 1)); }
return d_str.begin() + static_cast<cudf::size_type>(length > 0);
}();

// each lane hashes substrings of the given width
for (auto itr = begin; itr < end; itr += cudf::detail::warp_size) {
auto const offset = itr.byte_offset();
auto const hash_str =
cudf::string_view(d_str.data() + offset, (itr + width).byte_offset() - offset);

// hashing each seed on the same section of string is 10x faster than
// computing the substrings for each seed
for (std::size_t seed_idx = 0; seed_idx < seeds.size(); ++seed_idx) {
auto const hasher = cudf::detail::MurmurHash3_32<cudf::string_view>{seeds[seed_idx]};
auto const hvalue = hasher(hash_str);
atomicMin(d_output + seed_idx, hvalue);
}
}
}
};

} // namespace

std::unique_ptr<cudf::column> minhash(cudf::strings_column_view const& input,
cudf::device_span<cudf::hash_value_type const> seeds,
cudf::size_type width,
cudf::hash_id hash_function,
rmm::cuda_stream_view stream,
rmm::mr::device_memory_resource* mr)
{
CUDF_EXPECTS(!seeds.empty(), "Parameter seeds cannot be empty", std::invalid_argument);
CUDF_EXPECTS(width >= 2,
"Parameter width should be an integer value of 2 or greater",
std::invalid_argument);
CUDF_EXPECTS(hash_function == cudf::hash_id::HASH_MURMUR3,
"Only murmur3 hash algorithm supported",
std::invalid_argument);
CUDF_EXPECTS(
(static_cast<std::size_t>(input.size()) * seeds.size()) <
static_cast<std::size_t>(std::numeric_limits<cudf::size_type>::max()),
"The number of seeds times the number of input rows must not exceed maximum of size_type",
std::invalid_argument);

auto output_type = cudf::data_type{cudf::type_to_id<cudf::hash_value_type>()};
if (input.is_empty()) { return cudf::make_empty_column(output_type); }

auto const d_strings = cudf::column_device_view::create(input.parent(), stream);

auto hashes = cudf::make_numeric_column(output_type,
input.size() * static_cast<cudf::size_type>(seeds.size()),
davidwendt marked this conversation as resolved.
Show resolved Hide resolved
cudf::mask_state::UNALLOCATED,
stream,
mr);
auto d_hashes = hashes->mutable_view().data<cudf::hash_value_type>();

thrust::for_each_n(
rmm::exec_policy(stream),
thrust::counting_iterator(std::size_t{0}),
static_cast<std::size_t>(input.size()) * static_cast<std::size_t>(cudf::detail::warp_size),
minhash_fn{*d_strings, seeds, width, d_hashes});

if (seeds.size() == 1) {
hashes->set_null_mask(cudf::detail::copy_bitmask(input.parent(), stream, mr),
input.null_count());
return hashes;
}

// build the offsets for the output lists column
auto offsets = cudf::detail::sequence(
input.size() + 1,
cudf::numeric_scalar<cudf::size_type>(0),
cudf::numeric_scalar<cudf::size_type>(static_cast<cudf::size_type>(seeds.size())),
stream,
mr);
hashes->set_null_mask(rmm::device_buffer{}, 0); // children have no nulls

// build the lists column from the offsets and the hashes
auto result = make_lists_column(input.size(),
std::move(offsets),
std::move(hashes),
input.null_count(),
cudf::detail::copy_bitmask(input.parent(), stream, mr),
stream,
mr);
// expect this condition to be very rare
if (input.null_count() > 0) {
result = cudf::detail::purge_nonempty_nulls(result->view(), stream, mr);
}
return result;
}

} // namespace detail

std::unique_ptr<cudf::column> minhash(cudf::strings_column_view const& input,
cudf::numeric_scalar<cudf::hash_value_type> seed,
cudf::size_type width,
cudf::hash_id hash_function,
rmm::mr::device_memory_resource* mr)
{
CUDF_FUNC_RANGE();
auto seeds = cudf::device_span<cudf::hash_value_type const>{seed.data(), 1};
return detail::minhash(input, seeds, width, hash_function, cudf::get_default_stream(), mr);
}

std::unique_ptr<cudf::column> minhash(cudf::strings_column_view const& input,
cudf::device_span<cudf::hash_value_type const> seeds,
cudf::size_type width,
cudf::hash_id hash_function,
rmm::mr::device_memory_resource* mr)
{
CUDF_FUNC_RANGE();
return detail::minhash(input, seeds, width, hash_function, cudf::get_default_stream(), mr);
}

} // namespace nvtext
1 change: 1 addition & 0 deletions cpp/tests/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -507,6 +507,7 @@ ConfigureTest(
TEXT_TEST
text/bpe_tests.cpp
text/edit_distance_tests.cpp
text/minhash_tests.cpp
text/ngrams_tests.cpp
text/ngrams_tokenize_tests.cpp
text/normalize_tests.cpp
Expand Down
Loading