Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Add support for Unicode classes in regular expressions #4415

Open
andygrove opened this issue Dec 21, 2021 · 1 comment
Open

[FEA] Add support for Unicode classes in regular expressions #4415

andygrove opened this issue Dec 21, 2021 · 1 comment
Labels
feature request New feature or request

Comments

@andygrove
Copy link
Contributor

Is your feature request related to a problem? Please describe.

Spark regexp supports a number of Unicode classes that we do not support on GPU (we currently fall back to CPU for any of these because we do not support \p).

Classes for Unicode scripts, blocks, categories and binary properties
--
\p{IsLatin} | A Latin script character (script)
\p{InGreek} | A character in the Greek block (block)
\p{Lu} | An uppercase letter (category)
\p{IsAlphabetic} | An alphabetic character (binary property)
\p{Sc} | A currency symbol
\P{InGreek} | Any character except one in the Greek block (negation)
[\p{L}&&[^\p{Lu}]] | Any letter except an uppercase letter (subtraction)

Spark example:

scala> spark.sql("SELECT s, s RLIKE '\\\\p{IsLatin}' FROM ss").show
+---+-------------------+
|  s|s RLIKE \p{IsLatin}|
+---+-------------------+
|  a|               true|
|  A|               true|
+---+-------------------+

scala> spark.sql("SELECT s, s RLIKE '\\\\p{IsGreek}' FROM ss").show
+---+-------------------+
|  s|s RLIKE \p{IsGreek}|
+---+-------------------+
|  a|              false|
|  A|              false|
+---+-------------------+

Describe the solution you'd like
Add support, or document that we do not support.

Describe alternatives you've considered
None

Additional context
None

@andygrove andygrove added feature request New feature or request ? - Needs Triage Need team to review and classify labels Dec 21, 2021
@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Jan 4, 2022
@anthony-chang anthony-chang self-assigned this Jun 13, 2022
@anthony-chang
Copy link
Contributor

I have been experimenting with transpiling these down to an equivalent character class and I don't think we can support these with our current approach. The equivalent character class for each of these are all extremely long. I would be concerned that complex patterns using these transpiled classes would run out of memory.

Also, I discovered cases such as \p{IsGreek} vs \p{InGreek} which are both valid but match a different set of characters. Java doesn't have any documentation about this...

@anthony-chang anthony-chang removed their assignment Jun 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants