Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add nvtext::minhash function #12961

Merged
merged 89 commits into from
Apr 21, 2023
Merged
Show file tree
Hide file tree
Changes from 9 commits
Commits
Show all changes
89 commits
Select commit Hold shift + click to select a range
35885c5
Add nvtext::minhash function
davidwendt Mar 16, 2023
a014367
fix missing parameter name in function declaration
davidwendt Mar 16, 2023
a6b40f4
fix typo in doxygen comment
davidwendt Mar 17, 2023
9016f86
Merge branch 'branch-23.04' into text-minhashing
davidwendt Mar 17, 2023
a31e107
Merge branch 'branch-23.04' into text-minhashing
davidwendt Mar 17, 2023
51d64f8
Merge branch 'branch-23.04' into text-minhashing
davidwendt Mar 17, 2023
5226d7f
add cython/python interface to nvtext::minhash
davidwendt Mar 17, 2023
21af847
fix style violations
davidwendt Mar 17, 2023
3d9ad5f
Merge branch 'branch-23.04' into text-minhashing
davidwendt Mar 17, 2023
8329b92
Merge branch 'branch-23.04' into text-minhashing
davidwendt Mar 20, 2023
9327813
add benchmark
davidwendt Mar 20, 2023
d441b0d
fix style violation
davidwendt Mar 20, 2023
3016c59
Merge branch 'branch-23.04' into text-minhashing
davidwendt Mar 20, 2023
e138476
Merge branch 'branch-23.04' into text-minhashing
davidwendt Mar 20, 2023
3330209
Merge branch 'text-minhashing' of github.com:davidwendt/cudf into tex…
davidwendt Mar 20, 2023
8d206fa
add doxygen group
davidwendt Mar 20, 2023
5ce07ba
Merge branch 'branch-23.04' into text-minhashing
davidwendt Mar 21, 2023
921c225
Merge branch 'branch-23.04' into text-minhashing
davidwendt Mar 21, 2023
202bf12
Merge branch 'branch-23.04' into text-minhashing
davidwendt Mar 21, 2023
6efdf21
Merge branch 'branch-23.04' into text-minhashing
davidwendt Mar 22, 2023
d7d947a
fix long strings issue
davidwendt Mar 22, 2023
33e5f33
Merge branch 'branch-23.04' into text-minhashing
davidwendt Mar 22, 2023
9e66eec
rework as warp parallel kernel
davidwendt Mar 23, 2023
3f8b39c
Merge branch 'branch-23.04' into text-minhashing
davidwendt Mar 23, 2023
c701df8
add multi-seed libcudf API
davidwendt Mar 24, 2023
21db66b
fix benchmark call
davidwendt Mar 24, 2023
bcc94e0
Merge branch 'branch-23.04' into text-minhashing
davidwendt Mar 24, 2023
0e3c4e3
change cython/python to use multi-seed API
davidwendt Mar 24, 2023
d8f06f3
move const itr vars outside the for
davidwendt Mar 24, 2023
9fde3d4
Merge branch 'branch-23.04' into text-minhashing
davidwendt Mar 24, 2023
41c14a2
switch for-loops and use atomicMin
davidwendt Mar 24, 2023
da78c2c
Merge branch 'branch-23.04' into text-minhashing
davidwendt Mar 27, 2023
28cf712
support seeds default parameter
davidwendt Mar 27, 2023
491342d
add multi-seed support to benchmark
davidwendt Mar 27, 2023
312ed6f
Merge branch 'branch-23.04' into text-minhashing
davidwendt Mar 28, 2023
68660ec
add hash function parameter
davidwendt Mar 28, 2023
b0154eb
Merge branch 'branch-23.06' into text-minhashing
davidwendt Mar 29, 2023
a07a25b
fix dstr.length <= width edge case
davidwendt Mar 29, 2023
d65ad8f
add more tests
davidwendt Mar 29, 2023
b4ca094
Merge branch 'branch-23.06' into text-minhashing
davidwendt Apr 3, 2023
256d7e3
move hash-id parameter to the end
davidwendt Apr 3, 2023
ff624dd
Merge branch 'branch-23.06' into text-minhashing
davidwendt Apr 4, 2023
0423e82
fix race condition on initializing hash output
davidwendt Apr 4, 2023
0684ba8
Merge branch 'branch-23.06' into text-minhashing
davidwendt Apr 5, 2023
e3f0aa5
Merge branch 'branch-23.06' into text-minhashing
davidwendt Apr 6, 2023
bd5d660
add call to sanitize nulls
davidwendt Apr 6, 2023
013ef44
Merge branch 'branch-23.06' into text-minhashing
davidwendt Apr 6, 2023
e9be32a
Merge branch 'branch-23.06' into text-minhashing
davidwendt Apr 7, 2023
1297345
Merge branch 'branch-23.06' into text-minhashing
davidwendt Apr 7, 2023
a00c959
use thrust::fill to init the output
davidwendt Apr 7, 2023
e192038
fix doxygen for multi-seed API
davidwendt Apr 7, 2023
019d183
Merge branch 'branch-23.06' into text-minhashing
davidwendt Apr 7, 2023
49db1e3
fix some comments
davidwendt Apr 7, 2023
ea331a2
Merge branch 'branch-23.06' into text-minhashing
davidwendt Apr 10, 2023
3303a88
Merge branch 'branch-23.06' into text-minhashing
davidwendt Apr 11, 2023
20ead37
Merge branch 'branch-23.06' into text-minhashing
davidwendt Apr 12, 2023
bc0102d
Merge branch 'branch-23.06' into text-minhashing
davidwendt Apr 12, 2023
36e3a65
Merge branch 'branch-23.06' into text-minhashing
davidwendt Apr 13, 2023
04b9229
use Optional[cudf.Series] declaration
davidwendt Apr 13, 2023
de0a93c
Merge branch 'text-minhashing' of github.com:davidwendt/cudf into tex…
davidwendt Apr 13, 2023
1bad8e2
add overflow check for seeds*input-rows
davidwendt Apr 13, 2023
a918b65
support std::size_t for for-each-n in warp-per-string functor
davidwendt Apr 13, 2023
b920e72
Merge branch 'text-minhashing' of github.com:davidwendt/cudf into tex…
davidwendt Apr 14, 2023
4c21c7c
Merge branch 'branch-23.06' into text-minhashing
davidwendt Apr 14, 2023
f3a1d6c
Merge branch 'branch-23.06' into text-minhashing
davidwendt Apr 14, 2023
6dcc042
Merge branch 'branch-23.06' into text-minhashing
davidwendt Apr 14, 2023
472108b
add tests for error cases
davidwendt Apr 14, 2023
134aa70
Merge branch 'branch-23.06' into text-minhashing
davidwendt Apr 14, 2023
1c564f4
Merge branch 'text-minhashing' of github.com:davidwendt/cudf into tex…
davidwendt Apr 16, 2023
94a52db
Merge branch 'branch-23.06' into text-minhashing
davidwendt Apr 16, 2023
92943d6
Merge branch 'branch-23.06' into text-minhashing
davidwendt Apr 16, 2023
386eea2
Merge branch 'branch-23.06' into text-minhashing
davidwendt Apr 17, 2023
f49e60e
Merge branch 'text-minhashing' of github.com:davidwendt/cudf into tex…
davidwendt Apr 17, 2023
a82c074
fix style violation
davidwendt Apr 17, 2023
789dd67
Merge branch 'branch-23.06' into text-minhashing
davidwendt Apr 17, 2023
6dc2430
Merge branch 'branch-23.06' into text-minhashing
davidwendt Apr 18, 2023
5583c9a
Merge branch 'branch-23.06' into text-minhashing
davidwendt Apr 18, 2023
f7aa11c
Merge branch 'branch-23.06' into text-minhashing
davidwendt Apr 18, 2023
5975167
fix doxygen comments
davidwendt Apr 18, 2023
1a04853
Merge branch 'text-minhashing' of github.com:davidwendt/cudf into tex…
davidwendt Apr 18, 2023
5d457f0
Merge branch 'branch-23.06' into text-minhashing
davidwendt Apr 18, 2023
0a06347
Merge branch 'branch-23.06' into text-minhashing
davidwendt Apr 19, 2023
6fc724f
fix merge conflict
davidwendt Apr 20, 2023
43135a2
Merge branch 'text-minhashing' of github.com:davidwendt/cudf into tex…
davidwendt Apr 20, 2023
9a13a3c
Merge branch 'branch-23.06' into text-minhashing
davidwendt Apr 20, 2023
4f1a1b0
remove unused cimport
davidwendt Apr 20, 2023
431897d
Merge branch 'branch-23.06' into text-minhashing
davidwendt Apr 20, 2023
9685210
Merge branch 'branch-23.06' into text-minhashing
davidwendt Apr 20, 2023
3a3656c
Merge branch 'branch-23.06' into text-minhashing
davidwendt Apr 20, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions cpp/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -578,6 +578,7 @@ add_library(
src/text/detokenize.cu
src/text/edit_distance.cu
src/text/generate_ngrams.cu
src/text/minhash.cu
src/text/ngrams_tokenize.cu
src/text/normalize.cu
src/text/replace.cu
Expand Down
52 changes: 52 additions & 0 deletions cpp/include/nvtext/minhash.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
/*
* Copyright (c) 2023, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
#pragma once

#include <cudf/column/column.hpp>
#include <cudf/hashing.hpp>
#include <cudf/strings/strings_column_view.hpp>

namespace nvtext {
/**
* @addtogroup nvtext_minhash
* @{
* @file
*/

/**
* @brief Returns the minhash value for each string
*
* Hash values are computed from substrings of each string and the
* minimum hash value is returned for each string.
*
* All null row entries are ignored and the output contains all valid rows.
*
* @param input Strings column to compute minhash
* @param width The character width used for apply substrings;
* Any string smaller than this width will not be hashed.
* Default is 4 characters.
davidwendt marked this conversation as resolved.
Show resolved Hide resolved
* @param seed Seed value used for the MurmurHash3_32 algorithm
* @param mr Device memory resource used to allocate the returned column's device memory
* @return Minhash values for each string in input
*/
std::unique_ptr<cudf::column> minhash(
cudf::strings_column_view const& input,
cudf::size_type width = 4,
cudf::hash_value_type seed = cudf::DEFAULT_HASH_SEED,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/** @} */ // end of group
} // namespace nvtext
104 changes: 104 additions & 0 deletions cpp/src/text/minhash.cu
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
/*
* Copyright (c) 2023, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

#include <nvtext/minhash.hpp>

#include <cudf/column/column.hpp>
#include <cudf/column/column_device_view.cuh>
#include <cudf/column/column_factories.hpp>
#include <cudf/detail/hashing.hpp>
#include <cudf/detail/null_mask.hpp>
#include <cudf/detail/nvtx/ranges.hpp>
#include <cudf/detail/utilities/device_operators.cuh>
#include <cudf/detail/utilities/hash_functions.cuh>
#include <cudf/strings/string_view.cuh>
#include <cudf/utilities/default_stream.hpp>
#include <cudf/utilities/error.hpp>

#include <rmm/cuda_stream_view.hpp>
#include <rmm/exec_policy.hpp>

#include <thrust/iterator/counting_iterator.h>
#include <thrust/transform.h>

namespace nvtext {
namespace detail {
namespace {

struct minhash_fn {
cudf::column_device_view d_strings;
cudf::size_type width;
cudf::hash_value_type seed;

__device__ cudf::hash_value_type operator()(cudf::size_type idx) const
{
if (d_strings.is_null(idx)) return 0;
auto const d_str = d_strings.element<cudf::string_view>(idx);

auto mh = cudf::hash_value_type{0};
for (cudf::size_type pos = 0; pos < d_str.length() - (width - 1); ++pos) {
auto const ss = d_str.substr(pos, width);
auto const hasher = cudf::detail::MurmurHash3_32<cudf::string_view>{seed};
auto const hvalue = hasher(ss);
// cudf::detail::hash_combine(seed, hasher(ss)); matches cudf::hash() result
davidwendt marked this conversation as resolved.
Show resolved Hide resolved

mh = mh > 0 ? cudf::detail::min(hvalue, mh) : hvalue;
davidwendt marked this conversation as resolved.
Show resolved Hide resolved
}

return mh;
}
};

} // namespace

std::unique_ptr<cudf::column> minhash(cudf::strings_column_view const& input,
cudf::size_type width,
cudf::hash_value_type seed,
rmm::cuda_stream_view stream,
rmm::mr::device_memory_resource* mr)
{
CUDF_EXPECTS(width > 1, "Parameter width should be an integer value of 2 or greater");
davidwendt marked this conversation as resolved.
Show resolved Hide resolved

auto output_type = cudf::data_type{cudf::type_to_id<cudf::hash_value_type>()};
if (input.is_empty()) { return cudf::make_empty_column(output_type); }

auto const d_strings = cudf::column_device_view::create(input.parent(), stream);

auto hashes =
cudf::make_numeric_column(output_type, input.size(), cudf::mask_state::UNALLOCATED, stream, mr);
auto d_hashes = hashes->mutable_view().data<cudf::hash_value_type>();

auto const itr = thrust::make_counting_iterator<cudf::size_type>(0);
auto const fn = minhash_fn{*d_strings, width, seed};
thrust::transform(rmm::exec_policy(stream), itr, itr + input.size(), d_hashes, fn);

hashes->set_null_mask(cudf::detail::copy_bitmask(input.parent(), stream, mr), input.null_count());

return hashes;
}

} // namespace detail

std::unique_ptr<cudf::column> minhash(cudf::strings_column_view const& input,
cudf::size_type width,
cudf::hash_value_type seed,
rmm::mr::device_memory_resource* mr)
{
CUDF_FUNC_RANGE();
return detail::minhash(input, width, seed, cudf::get_default_stream(), mr);
}

} // namespace nvtext
1 change: 1 addition & 0 deletions cpp/tests/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -444,6 +444,7 @@ ConfigureTest(
TEXT_TEST
text/bpe_tests.cpp
text/edit_distance_tests.cpp
text/minhash_tests.cpp
text/ngrams_tests.cpp
text/ngrams_tokenize_tests.cpp
text/normalize_tests.cpp
Expand Down
56 changes: 56 additions & 0 deletions cpp/tests/text/minhash_tests.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
/*
* Copyright (c) 2023, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

#include <cudf/column/column.hpp>
#include <cudf/strings/strings_column_view.hpp>
#include <nvtext/minhash.hpp>

#include <cudf_test/base_fixture.hpp>
#include <cudf_test/column_utilities.hpp>
#include <cudf_test/column_wrapper.hpp>

#include <vector>

struct MinHashTest : public cudf::test::BaseFixture {
};

TEST_F(MinHashTest, Basic)
{
auto input = cudf::test::strings_column_wrapper({"doc 1", "", "this is doc 2", "", "doc 3"},
{1, 0, 1, 1, 1});

auto view = cudf::strings_column_view(input);

auto results = nvtext::minhash(view);

auto expected = cudf::test::fixed_width_column_wrapper<cudf::hash_value_type>(
{1207251914u, 0u, 21141582u, 0u, 1207251914u}, {1, 0, 1, 1, 1});
CUDF_TEST_EXPECT_COLUMNS_EQUAL(*results, expected);
}

TEST_F(MinHashTest, EmptyTest)
{
auto input = cudf::make_empty_column(cudf::data_type{cudf::type_id::STRING});
auto view = cudf::strings_column_view(input->view());
auto results = nvtext::minhash(view);
EXPECT_EQ(results->size(), 0);
}

TEST_F(MinHashTest, ErrorsTest)
{
auto input = cudf::test::strings_column_wrapper({"pup"});
EXPECT_THROW(nvtext::minhash(cudf::strings_column_view(input), 0), cudf::logic_error);
}
17 changes: 17 additions & 0 deletions python/cudf/cudf/_lib/cpp/nvtext/minhash.pxd
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# Copyright (c) 2023, NVIDIA CORPORATION.

from libc.stdint cimport uint32_t
davidwendt marked this conversation as resolved.
Show resolved Hide resolved
from libcpp.memory cimport unique_ptr

from cudf._lib.cpp.column.column cimport column
from cudf._lib.cpp.column.column_view cimport column_view
from cudf._lib.cpp.types cimport size_type


cdef extern from "nvtext/minhash.hpp" namespace "nvtext" nogil:

cdef unique_ptr[column] minhash(
const column_view &strings,
size_type ngrams,
uint32_t seed
) except +
6 changes: 3 additions & 3 deletions python/cudf/cudf/_lib/nvtext/CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# =============================================================================
# Copyright (c) 2022, NVIDIA CORPORATION.
# Copyright (c) 2022-2023, NVIDIA CORPORATION.
#
# Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except
# in compliance with the License. You may obtain a copy of the License at
Expand All @@ -12,8 +12,8 @@
# the License.
# =============================================================================

set(cython_sources edit_distance.pyx generate_ngrams.pyx ngrams_tokenize.pyx normalize.pyx
replace.pyx stemmer.pyx subword_tokenize.pyx tokenize.pyx
set(cython_sources edit_distance.pyx generate_ngrams.pyx minhash.pyx ngrams_tokenize.pyx
normalize.pyx replace.pyx stemmer.pyx subword_tokenize.pyx tokenize.pyx
)
set(linked_libraries cudf::cudf)
rapids_cython_create_modules(
Expand Down
33 changes: 33 additions & 0 deletions python/cudf/cudf/_lib/nvtext/minhash.pyx
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# Copyright (c) 2023, NVIDIA CORPORATION.

from cudf.core.buffer import acquire_spill_lock

from libc.stdint cimport uint32_t
from libcpp.memory cimport unique_ptr
from libcpp.utility cimport move

from cudf._lib.column cimport Column
from cudf._lib.cpp.column.column cimport column
from cudf._lib.cpp.column.column_view cimport column_view
from cudf._lib.cpp.nvtext.minhash cimport minhash as cpp_minhash
from cudf._lib.cpp.types cimport size_type


@acquire_spill_lock()
def minhash(Column strings, int width, int seed=0):

cdef column_view c_strings = strings.view()
cdef size_type c_width = width
cdef uint32_t c_seed = seed
cdef unique_ptr[column] c_result

with nogil:
c_result = move(
cpp_minhash(
c_strings,
c_width,
c_seed
)
)

return Column.from_unique_ptr(move(c_result))
3 changes: 2 additions & 1 deletion python/cudf/cudf/_lib/strings/__init__.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,10 @@
# Copyright (c) 2020-2022, NVIDIA CORPORATION.
# Copyright (c) 2020-2023, NVIDIA CORPORATION.
from cudf._lib.nvtext.edit_distance import edit_distance, edit_distance_matrix
from cudf._lib.nvtext.generate_ngrams import (
generate_character_ngrams,
generate_ngrams,
)
from cudf._lib.nvtext.minhash import minhash
from cudf._lib.nvtext.ngrams_tokenize import ngrams_tokenize
from cudf._lib.nvtext.normalize import normalize_characters, normalize_spaces
from cudf._lib.nvtext.replace import filter_tokens, replace_tokens
Expand Down
26 changes: 26 additions & 0 deletions python/cudf/cudf/core/column/string.py
Original file line number Diff line number Diff line change
Expand Up @@ -5226,6 +5226,32 @@ def edit_distance_matrix(self) -> SeriesOrIndex:
libstrings.edit_distance_matrix(self._column)
)

def minhash(self, n: int = 4, seed: int = 0) -> SeriesOrIndex:
"""
Compute the minhash of a strings column.

Parameters
----------
n : int
The width of the substring to hash.
Default of 4 characters.
seed : int
The seed used for the hash algorithm.
Default is 0.

Examples
--------
>>> import cudf
>>> str_series = cudf.Series(['this is my', 'favorite book'])
>>> str_series.str.minhash()
0 2012639418
1 182731933
dtype: int32
"""
return self._return_or_inplace(
libstrings.minhash(self._column, n, seed)
)


def _massage_string_arg(value, name, allow_col=False):
if isinstance(value, str):
Expand Down