[FEA] Add support for Unicode classes in regular expressions #4415

andygrove · 2021-12-21T23:42:19Z

Is your feature request related to a problem? Please describe.

Spark regexp supports a number of Unicode classes that we do not support on GPU (we currently fall back to CPU for any of these because we do not support \p).

Classes for Unicode scripts, blocks, categories and binary properties
--
\p{IsLatin} | A Latin script character (script)
\p{InGreek} | A character in the Greek block (block)
\p{Lu} | An uppercase letter (category)
\p{IsAlphabetic} | An alphabetic character (binary property)
\p{Sc} | A currency symbol
\P{InGreek} | Any character except one in the Greek block (negation)
[\p{L}&&[^\p{Lu}]] | Any letter except an uppercase letter (subtraction)

Spark example:

scala> spark.sql("SELECT s, s RLIKE '\\\\p{IsLatin}' FROM ss").show
+---+-------------------+
|  s|s RLIKE \p{IsLatin}|
+---+-------------------+
|  a|               true|
|  A|               true|
+---+-------------------+

scala> spark.sql("SELECT s, s RLIKE '\\\\p{IsGreek}' FROM ss").show
+---+-------------------+
|  s|s RLIKE \p{IsGreek}|
+---+-------------------+
|  a|              false|
|  A|              false|
+---+-------------------+

Describe the solution you'd like
Add support, or document that we do not support.

Describe alternatives you've considered
None

Additional context
None

The text was updated successfully, but these errors were encountered:

anthony-chang · 2022-06-14T18:18:41Z

I have been experimenting with transpiling these down to an equivalent character class and I don't think we can support these with our current approach. The equivalent character class for each of these are all extremely long. I would be concerned that complex patterns using these transpiled classes would run out of memory.

Also, I discovered cases such as \p{IsGreek} vs \p{InGreek} which are both valid but match a different set of characters. Java doesn't have any documentation about this...

andygrove added feature request New feature or request ? - Needs Triage Need team to review and classify labels Dec 21, 2021

sameerz removed the ? - Needs Triage Need team to review and classify label Jan 4, 2022

andygrove mentioned this issue Jan 12, 2022

[FEA] Enable regular expressions by default #4509

Open

61 tasks

anthony-chang self-assigned this Jun 13, 2022

anthony-chang removed their assignment Jun 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Add support for Unicode classes in regular expressions #4415

[FEA] Add support for Unicode classes in regular expressions #4415

andygrove commented Dec 21, 2021

anthony-chang commented Jun 14, 2022

[FEA] Add support for Unicode classes in regular expressions #4415

[FEA] Add support for Unicode classes in regular expressions #4415

Comments

andygrove commented Dec 21, 2021

anthony-chang commented Jun 14, 2022