-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(function): Add Spark locate function #8863
Conversation
✅ Deploy Preview for meta-velox canceled.
|
This pull request has been automatically marked as stale because it has not had recent activity. If you'd still like this PR merged, please comment on the PR, make sure you've addressed reviewer comments, and rebase on the latest main. Thank you for your contributions! |
aa854a2
to
391b3b3
Compare
This pull request has been automatically marked as stale because it has not had recent activity. If you'd still like this PR merged, please comment on the PR, make sure you've addressed reviewer comments, and rebase on the latest main. Thank you for your contributions! |
@rui-mo , will you continue this PR? |
@zhli1142015 Yes, this PR is ready overall. Would you like to take a review? Thanks. |
velox/functions/sparksql/String.h
Outdated
const arg_type<Varchar>& subString, | ||
const arg_type<Varchar>& string, | ||
const arg_type<int32_t>& start) { | ||
result = stringImpl::stringPosition<true /*isAscii*/>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe it's better to call findNthInstanceByteIndexFromStart
here, which support start position parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stringPosition
calls findNthInstanceByteIndexFromStart
, and it also handles the edge cases like empty substr, invalid start, so we can avoid duplicate code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point
Passes the 1-minute spark expression fuzzer test.
|
} | ||
|
||
// The string to search for substring. | ||
auto view = std::string_view( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can pass the startByteIndex
to findNthInstanceByteIndexFromStart
, instead of creating a new string_view object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
findNthInstanceByteIndexFromEnd
does not accept the startByteIndex
parameter, and we also use view
at L255.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, stringPosition
also supports to serach from end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But it's little wired, this function searches from end but with start
parameter as value 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand your point. Updated this PR to make the start the end position when searching from the end. Thanks.
3ced56f
to
ae30a78
Compare
} | ||
|
||
// The string to search for substring. | ||
auto view = std::string_view( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, stringPosition
also supports to serach from end.
} | ||
|
||
// The string to search for substring. | ||
auto view = std::string_view( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But it's little wired, this function searches from end but with start
parameter as value 1.
b002b11
to
327a582
Compare
Hi @mbasmanova, would you like to take a review? Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rui-mo Some questions.
after position ``start``. The search is from the beginning of ``string`` to the end. | ||
The given ``start`` and return value are 1-based. The following rules are listed in order of priority: | ||
|
||
Returns 0 if ``start`` is NULL. Returns NULL if ``substring`` or ``string`` is NULL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do these rules match Spark? Any particular version or all versions are having the same behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it matches Spark, and we are testing against Spark-3.5.1. The versions before Spark-3.5.1 have the same behavior.
const T& string, | ||
const T& subString, | ||
int64_t instance = 0, | ||
std::optional<int64_t> startPosition = std::nullopt) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering if it makes sense to replace (string, startPosition) arguments with a single std::string_view argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion. Updated to use std::string_view argument type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rui-mo Thanks.
velox/functions/sparksql/String.h
Outdated
// Unicode string "😋😋😋", each character occupies 4 bytes. When 'start' is | ||
// 2, the 'startByteIndex' is 4 which specifies the start of the second | ||
// character. | ||
const char* pos = string->data(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we have a helper function to find the location of the N-th char.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I found the helper function 'stringCore::cappedByteLengthUnicode' and changed to use it. Thanks for the remainder.
83de5d7
to
2db439c
Compare
Opened #11549 to track the failure found in Presto fuzzer test. |
@kagamiori has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@kagamiori merged this pull request in c286451. |
Conbench analyzed the 1 benchmark run on commit There were no benchmark performance regressions. 🎉 The full Conbench report has more details. |
A function that returns the position of the first occurrence of substring in
given string after the start position.
Doc: https://spark.apache.org/docs/latest/api/sql/index.html#locate
Spark implementation: https://github.com/apache/spark/blob/v3.5.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala#L1420