Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(function): Add Spark locate function #8863

Closed
wants to merge 1 commit into from

Conversation

rui-mo
Copy link
Collaborator

@rui-mo rui-mo commented Feb 27, 2024

A function that returns the position of the first occurrence of substring in
given string after the start position.

Doc: https://spark.apache.org/docs/latest/api/sql/index.html#locate
Spark implementation: https://github.com/apache/spark/blob/v3.5.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala#L1420

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 27, 2024
Copy link

netlify bot commented Feb 27, 2024

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 5f8e1c2
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/673c53c4c7a95d000853a4ed

@rui-mo rui-mo changed the title Support Spark locate function Add Spark locate function Feb 27, 2024
Copy link

stale bot commented May 27, 2024

This pull request has been automatically marked as stale because it has not had recent activity. If you'd still like this PR merged, please comment on the PR, make sure you've addressed reviewer comments, and rebase on the latest main. Thank you for your contributions!

@stale stale bot added the stale label May 27, 2024
@stale stale bot removed the stale label May 28, 2024
@rui-mo rui-mo force-pushed the wip_locate branch 3 times, most recently from aa854a2 to 391b3b3 Compare July 5, 2024 03:01
Copy link

stale bot commented Oct 5, 2024

This pull request has been automatically marked as stale because it has not had recent activity. If you'd still like this PR merged, please comment on the PR, make sure you've addressed reviewer comments, and rebase on the latest main. Thank you for your contributions!

@zhli1142015
Copy link
Contributor

@rui-mo , will you continue this PR?

@rui-mo
Copy link
Collaborator Author

rui-mo commented Nov 7, 2024

@zhli1142015 Yes, this PR is ready overall. Would you like to take a review? Thanks.

const arg_type<Varchar>& subString,
const arg_type<Varchar>& string,
const arg_type<int32_t>& start) {
result = stringImpl::stringPosition<true /*isAscii*/>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe it's better to call findNthInstanceByteIndexFromStart here, which support start position parameter.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stringPosition calls findNthInstanceByteIndexFromStart, and it also handles the edge cases like empty substr, invalid start, so we can avoid duplicate code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point

@rui-mo
Copy link
Collaborator Author

rui-mo commented Nov 8, 2024

Passes the 1-minute spark expression fuzzer test.

./velox/expression/fuzzer/spark_expression_fuzzer_test --seed ${RANDOM} --enable_variadic_signatures --lazy_vector_generation_ratio 0.2 --velox_fuzzer_enable_column_reuse --velox_fuzzer_enable_expression_reuse --max_expression_trees_per_step 2 --retry_with_try --enable_dereference --duration_sec 60 --minloglevel=1 --stderrthreshold=2 --only locate
E1108 18:29:23.069322 3621308 ExpressionFuzzerVerifier.cpp:408] Total iterations: 305566
E1108 18:29:23.069486 3621308 ExpressionFuzzerVerifier.cpp:409] Total failed: 0

velox/functions/sparksql/String.h Outdated Show resolved Hide resolved
}

// The string to search for substring.
auto view = std::string_view(
Copy link
Contributor

@zhli1142015 zhli1142015 Nov 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can pass the startByteIndex to findNthInstanceByteIndexFromStart, instead of creating a new string_view object.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

findNthInstanceByteIndexFromEnd does not accept the startByteIndex parameter, and we also use view at L255.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, stringPosition also supports to serach from end.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it's little wired, this function searches from end but with start parameter as value 1.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand your point. Updated this PR to make the start the end position when searching from the end. Thanks.

@rui-mo rui-mo force-pushed the wip_locate branch 2 times, most recently from 3ced56f to ae30a78 Compare November 11, 2024 05:57
}

// The string to search for substring.
auto view = std::string_view(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, stringPosition also supports to serach from end.

}

// The string to search for substring.
auto view = std::string_view(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it's little wired, this function searches from end but with start parameter as value 1.

@rui-mo rui-mo force-pushed the wip_locate branch 3 times, most recently from b002b11 to 327a582 Compare November 12, 2024 05:28
@rui-mo
Copy link
Collaborator Author

rui-mo commented Nov 13, 2024

Hi @mbasmanova, would you like to take a review? Thanks!

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rui-mo Some questions.

after position ``start``. The search is from the beginning of ``string`` to the end.
The given ``start`` and return value are 1-based. The following rules are listed in order of priority:

Returns 0 if ``start`` is NULL. Returns NULL if ``substring`` or ``string`` is NULL.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these rules match Spark? Any particular version or all versions are having the same behavior?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it matches Spark, and we are testing against Spark-3.5.1. The versions before Spark-3.5.1 have the same behavior.

https://github.com/apache/spark/blame/v3.5.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala#L1434-L1456

velox/functions/lib/string/StringImpl.h Outdated Show resolved Hide resolved
velox/functions/lib/string/StringImpl.h Outdated Show resolved Hide resolved
const T& string,
const T& subString,
int64_t instance = 0,
std::optional<int64_t> startPosition = std::nullopt) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if it makes sense to replace (string, startPosition) arguments with a single std::string_view argument.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. Updated to use std::string_view argument type.

velox/functions/lib/string/StringImpl.h Show resolved Hide resolved
Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rui-mo Thanks.

// Unicode string "😋😋😋", each character occupies 4 bytes. When 'start' is
// 2, the 'startByteIndex' is 4 which specifies the start of the second
// character.
const char* pos = string->data();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we have a helper function to find the location of the N-th char.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I found the helper function 'stringCore::cappedByteLengthUnicode' and changed to use it. Thanks for the remainder.

@mbasmanova mbasmanova added the ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall label Nov 14, 2024
@rui-mo rui-mo force-pushed the wip_locate branch 2 times, most recently from 83de5d7 to 2db439c Compare November 15, 2024 03:04
@rui-mo rui-mo changed the title Add Spark locate function feat(function): Add Spark locate function Nov 15, 2024
@rui-mo
Copy link
Collaborator Author

rui-mo commented Nov 15, 2024

Opened #11549 to track the failure found in Presto fuzzer test.

@facebook-github-bot
Copy link
Contributor

@kagamiori has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@kagamiori merged this pull request in c286451.

Copy link

Conbench analyzed the 1 benchmark run on commit c286451a.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants