-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add regex rewrite kernel to find literal[a,b]{x,y}
in a string
#2041
Conversation
Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
extern "C" { | ||
|
||
JNIEXPORT jlong JNICALL Java_com_nvidia_spark_rapids_jni_RegexRewriteUtils_literalRangePattern( | ||
JNIEnv* env, jclass, jlong column_view, jlong target, jint d, jint start, jint end) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do not use the existing class name as variable name to avoid potential name clashing.
JNIEnv* env, jclass, jlong column_view, jlong target, jint d, jint start, jint end) | |
JNIEnv* env, jclass, jlong input, jlong target, jint d, jint start, jint end) |
template <typename BoolFunction> | ||
std::unique_ptr<cudf::column> literal_range_pattern_fn(cudf::strings_column_view const& strings, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Functions/classes used only within the current file need to be wrapped in anonymous namespace to avoid symbol conflict.
int const start, | ||
int const end, | ||
rmm::cuda_stream_view stream = rmm::cuda_stream_default, | ||
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please start using device_async_resource_ref
.
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource()); | |
rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource()); |
NativeDepsLoader.loadNativeDeps(); | ||
} | ||
|
||
public static ColumnVector literalRangePattern(ColumnVector input, Scalar pattern, int d, int start, int end) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, please add docs for this function.
#include <cudf/column/column_device_view.cuh> | ||
#include <cudf/column/column_factories.hpp> | ||
#include <cudf/detail/iterator.cuh> | ||
#include <cudf/detail/null_mask.hpp> | ||
#include <cudf/detail/nvtx/ranges.hpp> | ||
#include <cudf/scalar/scalar_factories.hpp> | ||
#include <cudf/strings/detail/utilities.hpp> | ||
#include <cudf/strings/find.hpp> | ||
#include <cudf/strings/string_view.cuh> | ||
#include <cudf/strings/strings_column_view.hpp> | ||
#include <cudf/utilities/default_stream.hpp> | ||
#include <cudf/utilities/error.hpp> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not all of these headers are used in this file. Please remove the unused ones.
import ai.rapids.cudf.Scalar; | ||
import org.junit.jupiter.api.Test; | ||
|
||
import com.nvidia.spark.rapids.jni.JSONUtils; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is used here?
#include <cudf/column/column_device_view.cuh> | ||
#include <cudf/column/column_factories.hpp> | ||
#include <cudf/detail/iterator.cuh> | ||
#include <cudf/detail/null_mask.hpp> | ||
#include <cudf/detail/nvtx/ranges.hpp> | ||
#include <cudf/detail/utilities/cuda.cuh> | ||
#include <cudf/scalar/scalar_factories.hpp> | ||
#include <cudf/strings/detail/utf8.hpp> | ||
#include <cudf/strings/detail/utilities.hpp> | ||
#include <cudf/strings/find.hpp> | ||
#include <cudf/strings/string_view.cuh> | ||
#include <cudf/strings/strings_column_view.hpp> | ||
#include <cudf/utilities/default_stream.hpp> | ||
#include <cudf/utilities/error.hpp> | ||
|
||
#include <rmm/cuda_stream_view.hpp> | ||
#include <rmm/exec_policy.hpp> | ||
|
||
#include <cuda/atomic> | ||
#include <thrust/binary_search.h> | ||
#include <thrust/fill.h> | ||
#include <thrust/for_each.h> | ||
#include <thrust/iterator/constant_iterator.h> | ||
#include <thrust/iterator/counting_iterator.h> | ||
#include <thrust/transform.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also see a lot of unused headers here. Please remove them.
* @param start Minimum code point value to check for in the range. | ||
* @param end Maximum code point value to check for in the range. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is code point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Plugin will pass utf-8 codepoints to jni of start and end in regex pattern literal[start-end]{len,}
rmm::cuda_stream_view stream, | ||
rmm::mr::device_memory_resource* mr) | ||
{ | ||
auto strings_count = strings.size(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
auto strings_count = strings.size(); | |
auto const strings_count = strings.size(); |
auto d_prefix = cudf::string_view(prefix.data(), prefix.size()); | ||
auto strings_column = cudf::column_device_view::create(strings.parent(), stream); | ||
auto d_strings = *strings_column; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
d_strings
should be a reference to avoid copying.
auto d_prefix = cudf::string_view(prefix.data(), prefix.size()); | |
auto strings_column = cudf::column_device_view::create(strings.parent(), stream); | |
auto d_strings = *strings_column; | |
auto const d_prefix = cudf::string_view(prefix.data(), prefix.size()); | |
auto const strings_column = cudf::column_device_view::create(strings.parent(), stream); | |
auto const& d_strings = *strings_column; |
auto d_prefix = cudf::string_view(prefix.data(), prefix.size()); | ||
auto strings_column = cudf::column_device_view::create(strings.parent(), stream); | ||
auto d_strings = *strings_column; | ||
// create output column |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is very trivial 😄
// create output column |
auto results_view = results->mutable_view(); | ||
auto d_results = results_view.data<bool>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
auto results_view = results->mutable_view(); | |
auto d_results = results_view.data<bool>(); | |
auto const d_results = results->mutable_view().data<bool>(); |
rmm::cuda_stream_view stream, | ||
rmm::mr::device_memory_resource* mr) | ||
{ | ||
auto pfn = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
auto pfn = | |
auto const pfn = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition, it is better to define this lamba as a separate functor in an anonymous namespace:
auto pfn = | |
namespace { | |
struct literal_range_pattern_fn { | |
__device__ bool operator()(...) const { | |
.... | |
} | |
}; | |
} |
The template function literal_range_pattern_fn
above can be renamed into something like:
namespace {
std::unique_ptr<cudf::column> find_literal_range_pattern(....) { ... }
}
Signed-off-by: Haoyang Li <[email protected]>
@ttnghia Thanks for the review, I think all addressed now. Will check those comments like using const and removing useless headers locally first next time 👀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I have a couple of more requests.
auto results = make_numeric_column(cudf::data_type{cudf::type_id::BOOL8}, | ||
strings_count, | ||
cudf::detail::copy_bitmask(strings.parent(), stream, mr), | ||
strings.null_count(), | ||
stream, | ||
mr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect that the file was not properly formated (using clang-format
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It passed clang-format in the pre-commit hook
Co-authored-by: Nghia Truong <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
build |
Signed-off-by: Haoyang Li <[email protected]>
build |
Related plugin pr: NVIDIA/spark-rapids#10822
This pr adds a custom kernel to rewrite regex patterns like
literal[a-b]{x,y}
to get performance gain forrlike
in spark-rapids.It checks if each string contains a substring that starts with a given prefix and follows at least x characters whose codepoints are in the range of [a,b].