Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add regex rewrite kernel to find literal[a,b]{x,y} in a string #2041

Merged
merged 18 commits into from
May 21, 2024

Conversation

thirtiseven
Copy link
Collaborator

@thirtiseven thirtiseven commented May 15, 2024

Related plugin pr: NVIDIA/spark-rapids#10822

This pr adds a custom kernel to rewrite regex patterns like literal[a-b]{x,y} to get performance gain for rlike in spark-rapids.

It checks if each string contains a substring that starts with a given prefix and follows at least x characters whose codepoints are in the range of [a,b].

Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
@thirtiseven thirtiseven marked this pull request as ready for review May 16, 2024 09:55
extern "C" {

JNIEXPORT jlong JNICALL Java_com_nvidia_spark_rapids_jni_RegexRewriteUtils_literalRangePattern(
JNIEnv* env, jclass, jlong column_view, jlong target, jint d, jint start, jint end)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not use the existing class name as variable name to avoid potential name clashing.

Suggested change
JNIEnv* env, jclass, jlong column_view, jlong target, jint d, jint start, jint end)
JNIEnv* env, jclass, jlong input, jlong target, jint d, jint start, jint end)

Comment on lines 62 to 63
template <typename BoolFunction>
std::unique_ptr<cudf::column> literal_range_pattern_fn(cudf::strings_column_view const& strings,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functions/classes used only within the current file need to be wrapped in anonymous namespace to avoid symbol conflict.

int const start,
int const end,
rmm::cuda_stream_view stream = rmm::cuda_stream_default,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please start using device_async_resource_ref.

Suggested change
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());
rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource());

NativeDepsLoader.loadNativeDeps();
}

public static ColumnVector literalRangePattern(ColumnVector input, Scalar pattern, int d, int start, int end) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, please add docs for this function.

Comment on lines 19 to 30
#include <cudf/column/column_device_view.cuh>
#include <cudf/column/column_factories.hpp>
#include <cudf/detail/iterator.cuh>
#include <cudf/detail/null_mask.hpp>
#include <cudf/detail/nvtx/ranges.hpp>
#include <cudf/scalar/scalar_factories.hpp>
#include <cudf/strings/detail/utilities.hpp>
#include <cudf/strings/find.hpp>
#include <cudf/strings/string_view.cuh>
#include <cudf/strings/strings_column_view.hpp>
#include <cudf/utilities/default_stream.hpp>
#include <cudf/utilities/error.hpp>
Copy link
Collaborator

@ttnghia ttnghia May 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not all of these headers are used in this file. Please remove the unused ones.

import ai.rapids.cudf.Scalar;
import org.junit.jupiter.api.Test;

import com.nvidia.spark.rapids.jni.JSONUtils;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is used here?

Comment on lines 17 to 41
#include <cudf/column/column_device_view.cuh>
#include <cudf/column/column_factories.hpp>
#include <cudf/detail/iterator.cuh>
#include <cudf/detail/null_mask.hpp>
#include <cudf/detail/nvtx/ranges.hpp>
#include <cudf/detail/utilities/cuda.cuh>
#include <cudf/scalar/scalar_factories.hpp>
#include <cudf/strings/detail/utf8.hpp>
#include <cudf/strings/detail/utilities.hpp>
#include <cudf/strings/find.hpp>
#include <cudf/strings/string_view.cuh>
#include <cudf/strings/strings_column_view.hpp>
#include <cudf/utilities/default_stream.hpp>
#include <cudf/utilities/error.hpp>

#include <rmm/cuda_stream_view.hpp>
#include <rmm/exec_policy.hpp>

#include <cuda/atomic>
#include <thrust/binary_search.h>
#include <thrust/fill.h>
#include <thrust/for_each.h>
#include <thrust/iterator/constant_iterator.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/transform.h>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also see a lot of unused headers here. Please remove them.

Comment on lines 55 to 56
* @param start Minimum code point value to check for in the range.
* @param end Maximum code point value to check for in the range.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is code point?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plugin will pass utf-8 codepoints to jni of start and end in regex pattern literal[start-end]{len,}

rmm::cuda_stream_view stream,
rmm::mr::device_memory_resource* mr)
{
auto strings_count = strings.size();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
auto strings_count = strings.size();
auto const strings_count = strings.size();

Comment on lines 77 to 79
auto d_prefix = cudf::string_view(prefix.data(), prefix.size());
auto strings_column = cudf::column_device_view::create(strings.parent(), stream);
auto d_strings = *strings_column;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

d_strings should be a reference to avoid copying.

Suggested change
auto d_prefix = cudf::string_view(prefix.data(), prefix.size());
auto strings_column = cudf::column_device_view::create(strings.parent(), stream);
auto d_strings = *strings_column;
auto const d_prefix = cudf::string_view(prefix.data(), prefix.size());
auto const strings_column = cudf::column_device_view::create(strings.parent(), stream);
auto const& d_strings = *strings_column;

auto d_prefix = cudf::string_view(prefix.data(), prefix.size());
auto strings_column = cudf::column_device_view::create(strings.parent(), stream);
auto d_strings = *strings_column;
// create output column
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is very trivial 😄

Suggested change
// create output column

Comment on lines 87 to 88
auto results_view = results->mutable_view();
auto d_results = results_view.data<bool>();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
auto results_view = results->mutable_view();
auto d_results = results_view.data<bool>();
auto const d_results = results->mutable_view().data<bool>();

rmm::cuda_stream_view stream,
rmm::mr::device_memory_resource* mr)
{
auto pfn =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
auto pfn =
auto const pfn =

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition, it is better to define this lamba as a separate functor in an anonymous namespace:

Suggested change
auto pfn =
namespace {
struct literal_range_pattern_fn {
__device__ bool operator()(...) const {
....
}
};
}

The template function literal_range_pattern_fn above can be renamed into something like:

namespace {
  std::unique_ptr<cudf::column> find_literal_range_pattern(....) { ... }
}

Signed-off-by: Haoyang Li <[email protected]>
@thirtiseven
Copy link
Collaborator Author

@ttnghia Thanks for the review, I think all addressed now. Will check those comments like using const and removing useless headers locally first next time 👀

Copy link
Collaborator

@ttnghia ttnghia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I have a couple of more requests.

src/main/cpp/src/regex_rewrite_utils.cu Show resolved Hide resolved
src/main/cpp/src/regex_rewrite_utils.cu Outdated Show resolved Hide resolved
Comment on lines +82 to +87
auto results = make_numeric_column(cudf::data_type{cudf::type_id::BOOL8},
strings_count,
cudf::detail::copy_bitmask(strings.parent(), stream, mr),
strings.null_count(),
stream,
mr);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect that the file was not properly formated (using clang-format).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It passed clang-format in the pre-commit hook

src/main/cpp/src/regex_rewrite_utils.cu Outdated Show resolved Hide resolved
thirtiseven and others added 2 commits May 21, 2024 13:24
Signed-off-by: Haoyang Li <[email protected]>
ttnghia
ttnghia previously approved these changes May 21, 2024
@thirtiseven
Copy link
Collaborator Author

build

Signed-off-by: Haoyang Li <[email protected]>
@thirtiseven
Copy link
Collaborator Author

build

@thirtiseven thirtiseven requested a review from ttnghia May 21, 2024 16:43
@thirtiseven thirtiseven merged commit 956fb53 into NVIDIA:branch-24.06 May 21, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants