Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch the library used in regex function by calling #768

Open
Lewuathe opened this issue May 14, 2019 · 10 comments
Open

Switch the library used in regex function by calling #768

Lewuathe opened this issue May 14, 2019 · 10 comments

Comments

@Lewuathe
Copy link
Member

Lewuathe commented May 14, 2019

Currently, we support two types of library JONI and RE2J for regex function. It's statically decided at launch time. But we sometimes want to change the library dynamically because the performance characteristics of each library are different. Selecting the appropriate library that fits each use case is desirable.

My suggestion is extending functions to support another field to specify the library as follows so that we can switch the library by calling.

regexp_like(col, '...', 'JONI')
regexp_like(col, '...', 'REF2J')

Or specifying the library as a session parameter might be another option.

@nurse
Copy link

nurse commented May 14, 2019

Just FYI, the characteristics is derived from their algorithms and illustrated on for example this http://lh3lh3.users.sourceforge.net/reb.shtml

@kokosing
Copy link
Member

Both libraries define the same set of functions, so they cannot be loaded together. It could be solved with #8.

However maybe we could have something like joni_regexp_like and re2j_regexp_like functions defined. This would pollute the function namespace, and it's usage would not be dynamic.

@Lewuathe
Copy link
Member Author

Lewuathe commented May 14, 2019

However maybe we could have something like joni_regexp_like and re2j_regexp_like functions defined. This would pollute the function namespace, and it's usage would not be dynamic.

Regarding the use cases we are considering, it's enough to define the different functions. But as you said, it's not good in terms of the namespace convention.

@martint Do you think it's acceptable to create aliases joni_regexp_like, re2j_regexp_like. Or #8 can be the solution for this kind of alias problem?

@kokosing
Copy link
Member

To be exact, joni defines something like regex_like(varchar, regex_type_for_joni) and re2j defines something like regex_like(varchar, regex_type_for_re2j). User is passing passing two varchars. We cannot load them both because implicit cast would be ambiguous.

@sopel39
Copy link
Member

sopel39 commented May 14, 2019

Or specifying the library as a session parameter might be another option.

This is problematic as Presto functions do not have access to system session properties.

However maybe we could have something like joni_regexp_like and re2j_regexp_like functions defined. This would pollute the function namespace, and it's usage would not be dynamic.

That would work.

@Praveen2112
Copy link
Member

What if we can cast the regex to JoniRegexpType or Re2JRegexpType and resolve it , then we can use similar function name right ?

@martint
Copy link
Member

martint commented May 14, 2019

In the short term, yes, #8 would be a way around it, either by defining the function in different namespaces and adding the desired on the the SQL PATH or by allowing functions to take session properties.

In the long term, we actually want to get rid of the JONI vs RE2J distinction if we can. We just haven't spent the time to make RE2J faster in every scenario. Adding the two implementation-specific aliases is problematic because it's harder to remove them in the future without breaking compatibility. If we were to expose different regex functions, it'd be based on the type of regex (POSIX, PCRE, etc), not based on their implementation library.

@nurse
Copy link

nurse commented May 15, 2019

I agree that in the long term it should be specified by type.

Just FYI, why we want to use RE2J is to speed up a pattern which is for example matching "Presto|Hive|Spark|Big ?Query" with Web page title to query a visitor who are interested in query engines. (real example has a hundred of |s) But JONI is not good at such patterns. (see also "Backtracking vs. state machines" section of Benchmark of Regex Libraries)

@ebyhr
Copy link
Member

ebyhr commented Feb 9, 2024

@Lewuathe @nurse Please note that #20619 is going to remove support for RE2J.

cc: @wendigo

@jinyangli34
Copy link
Contributor

We recently see a query pattern runs extremely slow with JONI. A query with multiple | on long string (up to 65536) keeps entire cluster busy until reached query timeout limit (1h). So this is not only a perf issue, but also stability issue.
Changing to RE2J finishes within 20s.
Want to call this out that people hitting perf issue with JONI may still need a workaround.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

8 participants