You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
Spark regexp supports a number of Unicode classes that we do not support on GPU (we currently fall back to CPU for any of these because we do not support \p).
Classes for Unicode scripts, blocks, categories and binary properties
--
\p{IsLatin} | A Latin script character (script)
\p{InGreek} | A character in the Greek block (block)
\p{Lu} | An uppercase letter (category)
\p{IsAlphabetic} | An alphabetic character (binary property)
\p{Sc} | A currency symbol
\P{InGreek} | Any character except one in the Greek block (negation)
[\p{L}&&[^\p{Lu}]] | Any letter except an uppercase letter (subtraction)
I have been experimenting with transpiling these down to an equivalent character class and I don't think we can support these with our current approach. The equivalent character class for each of these are all extremely long. I would be concerned that complex patterns using these transpiled classes would run out of memory.
Also, I discovered cases such as \p{IsGreek} vs \p{InGreek} which are both valid but match a different set of characters. Java doesn't have any documentation about this...
Is your feature request related to a problem? Please describe.
Spark regexp supports a number of Unicode classes that we do not support on GPU (we currently fall back to CPU for any of these because we do not support
\p
).Spark example:
Describe the solution you'd like
Add support, or document that we do not support.
Describe alternatives you've considered
None
Additional context
None
The text was updated successfully, but these errors were encountered: