Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cudf::strings::compute_regex_state_memory API #10808

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
ee00388
Add cudf::strings::compute_regex_state_memory API
davidwendt May 6, 2022
6b09fd2
reword doxygen wording words
davidwendt May 6, 2022
04f8806
fix wording again
davidwendt May 6, 2022
68dfe59
rephrase doxygen concerning output-rows
davidwendt May 9, 2022
1429d51
Merge branch 'branch-22.06' into regex-state-memory-api
davidwendt May 11, 2022
9eea981
Merge branch 'branch-22.06' into regex-state-memory-api
davidwendt May 11, 2022
edf4959
add config.hpp to meta.yaml
davidwendt May 11, 2022
ca875c5
use namespace cudf::strings statement
davidwendt May 11, 2022
60e6837
Merge branch 'branch-22.06' into regex-state-memory-api
davidwendt May 19, 2022
c9078c3
Merge branch 'branch-22.06' into regex-state-memory-api
davidwendt May 24, 2022
a1ad54f
Merge branch 'branch-22.06' into regex-state-memory-api
davidwendt May 24, 2022
2ea7c9b
Merge branch 'branch-22.08' into regex-state-memory-api
davidwendt May 25, 2022
209ab39
Merge branch 'branch-22.08' into regex-state-memory-api
davidwendt May 26, 2022
b846702
Merge branch 'branch-22.08' into regex-state-memory-api
davidwendt May 27, 2022
89aa8f9
Merge branch 'branch-22.08' into regex-state-memory-api
davidwendt May 31, 2022
395ad51
Merge branch 'branch-22.08' into regex-state-memory-api
davidwendt Jun 1, 2022
0229815
Merge branch 'branch-22.08' into regex-state-memory-api
davidwendt Jun 2, 2022
fd6f08a
Merge branch 'branch-22.08' into regex-state-memory-api
davidwendt Jun 3, 2022
42eac7d
Merge branch 'branch-22.08' into regex-state-memory-api
davidwendt Jun 7, 2022
25c5b21
Merge branch 'branch-22.08' into regex-state-memory-api
davidwendt Jun 13, 2022
50d6a96
Merge branch 'branch-22.08' into regex-state-memory-api
davidwendt Jun 15, 2022
25bc135
Merge branch 'branch-22.08' into regex-state-memory-api
davidwendt Jun 23, 2022
fa45c23
change std::string parm to std::string_view
davidwendt Jun 23, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions conda/recipes/libcudf/meta.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -232,6 +232,7 @@ outputs:
- test -f $PREFIX/include/cudf/strings/json.hpp
- test -f $PREFIX/include/cudf/strings/padding.hpp
- test -f $PREFIX/include/cudf/strings/regex/flags.hpp
- test -f $PREFIX/include/cudf/strings/regex/config.hpp
- test -f $PREFIX/include/cudf/strings/repeat_strings.hpp
- test -f $PREFIX/include/cudf/strings/replace.hpp
- test -f $PREFIX/include/cudf/strings/replace_re.hpp
Expand Down
1 change: 1 addition & 0 deletions cpp/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -492,6 +492,7 @@ add_library(
src/strings/filter_chars.cu
src/strings/padding.cu
src/strings/json/json_path.cu
src/strings/regex/config.cpp
src/strings/regex/regcomp.cpp
src/strings/regex/regexec.cu
src/strings/repeat_strings.cu
Expand Down
55 changes: 55 additions & 0 deletions cpp/include/cudf/strings/regex/config.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
/*
* Copyright (c) 2022, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
#pragma once

#include <cudf/strings/regex/flags.hpp>
#include <cudf/strings/strings_column_view.hpp>

namespace cudf::strings {

/**
* @addtogroup strings_regex
* @{
*/

/**
* @brief Compute the working memory size for evaluating a regex pattern
* on a given strings column.
*
* This function returns the size in bytes of the memory needed to evaluate
* the given regex pattern in parallel over the returned output rows.
* The number of output rows will be less than or equal to the size of the
* input column.
*
* This function computes only the state data memory size required to process
* a regex pattern over the output row count.
* Specific functions that use regex may require additional working memory
* unrelated to the regex processing.
*
* @param input Strings instance
davidwendt marked this conversation as resolved.
Show resolved Hide resolved
* @param pattern Regex pattern to be used
* @param flags Regex flags for interpreting special characters in the pattern
* @return Size of the state memory in bytes required for processing `pattern` on `strings`
* and the number of concurrent rows this memory will support
*/
std::pair<std::size_t, size_type> compute_regex_state_memory(
davidwendt marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This API seems fine for a "just in time" check for how much memory the call will take, but currently the RAPIDS Accelerator does "up front" planning on the driver node (which does not necessarily have a GPU). At that point we don't know what the column data will be, just the pattern that will be used.

Is there a way to get a worst-case ballpark estimate of what memory will be used based solely on the pattern? Alternatively, it would be great if we could specify a maximum amount of memory we want the regex to use internally, and the concurrent rows would shrink to fit within that memory (or fail if it cannot).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now (with this API) you could call it with 1 row and do the math to help estimate the max size.
I'm trying to avoid another API to set the size unless absolutely necessary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But doesn't the row require having a device? We cannot require the driver node has a device.

Copy link
Contributor Author

@davidwendt davidwendt May 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But doesn't the row require having a device? We cannot require the driver node has a device.

This API does require a device.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently the API requires a string_view, but only to get the number of rows within it. Can we just pass the number of rows directly along with the pattern rather than a type that requires a device?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that sounds reasonable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jlowe is it worthwhile (or even possible) to give this API a shot in the Spark accelerator before we merge, rather than merging an API that we end up just removing a release or two later? It sounds like there are questions about its utility, so it would be nice to do our due diligence first if that is possible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is possible to use this, but I am not sure on priorities for this right now. To use it we would have to essentially implement the API we want #10852 using this one, but in a worse way. Before doing the regexp call this and see how big the memory usage would be. If it is too large, then we slice the input into smaller pieces and call it again to see if we guessed right at where to slice it. If it looks good then we do the regexp kernel and then have to concat the results again.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@revans2 so what is the best path forward here? Should we leave this PR up as is until the Spark team has a chance to decide on priorities and maybe give this a whirl?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't see us using this in the current form. We can keep the code around on this branch in case things change and we have to resurrect it. But I don't see a reason to check it in or keep the PR open.

strings_column_view const& input,
std::string_view pattern,
regex_flags const flags = regex_flags::DEFAULT);

/** @} */ // end of doxygen group

} // namespace cudf::strings
4 changes: 2 additions & 2 deletions cpp/include/cudf/strings/regex/flags.hpp
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/*
* Copyright (c) 2021, NVIDIA CORPORATION.
* Copyright (c) 2021-2022, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
Expand All @@ -21,7 +21,7 @@ namespace cudf {
namespace strings {

/**
* @addtogroup strings_contains
* @addtogroup strings_regex
* @{
*/

Expand Down
1 change: 1 addition & 0 deletions cpp/include/doxygen_groups.h
Original file line number Diff line number Diff line change
Expand Up @@ -129,6 +129,7 @@
* @defgroup strings_replace Replacing
* @defgroup strings_split Splitting
* @defgroup strings_json JSON
* @defgroup strings_regex Regex Config
* @}
* @defgroup dictionary_apis Dictionary
* @{
Expand Down
46 changes: 46 additions & 0 deletions cpp/src/strings/regex/config.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
/*
* Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

#include "regex.cuh"
davidwendt marked this conversation as resolved.
Show resolved Hide resolved

#include <cudf/detail/nvtx/ranges.hpp>
#include <cudf/strings/regex/config.hpp>

#include <rmm/cuda_stream_view.hpp>

namespace cudf::strings {
namespace detail {

std::pair<std::size_t, size_type> compute_regex_state_memory(strings_column_view const& input,
davidwendt marked this conversation as resolved.
Show resolved Hide resolved
std::string_view pattern,
regex_flags const flags,
rmm::cuda_stream_view stream)
{
auto const d_prog = reprog_device::create(pattern, flags, stream);
return d_prog->compute_strided_working_memory(input.size());
}

} // namespace detail

std::pair<std::size_t, size_type> compute_regex_state_memory(strings_column_view const& input,
std::string_view pattern,
regex_flags const flags)
{
CUDF_FUNC_RANGE();
return detail::compute_regex_state_memory(input, pattern, flags, rmm::cuda_stream_default);
}

} // namespace cudf::strings
2 changes: 1 addition & 1 deletion cpp/src/strings/regex/regexec.cu
Original file line number Diff line number Diff line change
Expand Up @@ -152,7 +152,7 @@ std::pair<std::size_t, int32_t> reprog_device::compute_strided_working_memory(
thread_count = min_rows;
buffer_size = working_memory_size(thread_count);
}
return std::make_pair(buffer_size, thread_count);
return std::pair(buffer_size, thread_count);
}

void reprog_device::set_working_memory(void* buffer, int32_t thread_count, int32_t max_insts)
Expand Down
2 changes: 1 addition & 1 deletion cpp/src/strings/regex/utilities.cuh
Original file line number Diff line number Diff line change
Expand Up @@ -140,7 +140,7 @@ auto make_strings_children(SizeAndExecuteFunction size_and_exec_fn,
size_and_exec_fn, d_prog, strings_count);
}

return std::make_pair(std::move(offsets), std::move(chars));
return std::pair(std::move(offsets), std::move(chars));
}

} // namespace detail
Expand Down
1 change: 1 addition & 0 deletions cpp/tests/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -406,6 +406,7 @@ ConfigureTest(
strings/ipv4_tests.cpp
strings/json_tests.cpp
strings/pad_tests.cpp
strings/regex_config_tests.cpp
strings/repeat_strings_tests.cpp
strings/replace_regex_tests.cpp
strings/replace_tests.cpp
Expand Down
70 changes: 70 additions & 0 deletions cpp/tests/strings/regex_config_tests.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
/*
* Copyright (c) 2022, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

#include <cudf/strings/regex/config.hpp>
#include <cudf/strings/strings_column_view.hpp>

#include <cudf_test/base_fixture.hpp>
#include <cudf_test/column_wrapper.hpp>

#include <rmm/device_uvector.hpp>

#include <string>

struct StringsRegexConfigTest : public cudf::test::BaseFixture {
};

TEST_F(StringsRegexConfigTest, Basic)
{
cudf::test::strings_column_wrapper input({"abc", "", "defghijk", "lmnop", "", "qrstuvwxyz"},
{1, 1, 1, 1, 0, 1});
davidwendt marked this conversation as resolved.
Show resolved Hide resolved
auto sv = cudf::strings_column_view(input);

auto results = cudf::strings::compute_regex_state_memory(sv, "hello");
EXPECT_EQ(results.first, 736);
EXPECT_EQ(results.second, sv.size());

results = cudf::strings::compute_regex_state_memory(sv, "");
EXPECT_EQ(results.first, 160);
EXPECT_EQ(results.second, sv.size());
}

TEST_F(StringsRegexConfigTest, Large)
{
auto const d_chars = rmm::device_uvector<char>{0, rmm::cuda_stream_default};
auto const d_offsets = cudf::detail::make_zeroed_device_uvector_sync<cudf::size_type>(16000001);
auto const d_nulls = rmm::device_uvector<cudf::bitmask_type>{0, rmm::cuda_stream_default};
auto const input = cudf::make_strings_column(d_chars, d_offsets, d_nulls, 0);
auto const sv = cudf::strings_column_view(input->view());

std::string pattern =
"a very large regular expression pattern whose contents do not really matter as much as the "
"length does";

auto results = cudf::strings::compute_regex_state_memory(sv, pattern);
EXPECT_EQ(results.first, 8344000000);
EXPECT_EQ(results.second, sv.size() / 4);
}

TEST_F(StringsRegexConfigTest, Empty)
{
auto empty_col = cudf::make_empty_column(cudf::type_id::STRING);
auto sv = cudf::strings_column_view(empty_col->view());

auto results = cudf::strings::compute_regex_state_memory(sv, "a");
EXPECT_EQ(results.first, 0);
EXPECT_EQ(results.second, sv.size());
}