Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

matchRegexAll stops parsing when encountering a UTF character that is 0 modulo 256. #181

Open
hugodro opened this issue Jul 10, 2021 · 1 comment
Labels

Comments

@hugodro
Copy link

hugodro commented Jul 10, 2021

When parsing a string (unicode) that contains '一' or '开', the matchRegexAll stopped looking for the pattern and terminates.
Did minimum testing and found in the samples that the characters that were of 0xNN00 were all causing the problem, thus the conclusion that a character that is 0 modulo 256 is the issue.

Example:

let
    targetPattern = "\\[needle:([^]]*)\\]"
   aString = initString
  in
    case Rgx.matchRegexAll (Rgx.mkRegex mediaPattern) aString of
      Nothing -> []
      Just (before, needle, after, values) -> etc...

That doesn't work if initString = "一杯奢华威士忌。[needle:some text]" or "开拓的精神进。[needle:more text]", but it does when adding the following piece of code:

aString = map (\c -> if mod (DC.ord c) 256 == 0 then ' ' else c) initString
in the preambule (DC is Data.Char).

Encountered with regex-base 0.94.0.1, regex-compat 0.95.2.1, regex-posix 0.96.0.0 installed, on ghc version 8.6.5.

@cdornan
Copy link
Contributor

cdornan commented Jul 12, 2021

Thanks very much for this @hugodro — planning a release next week with a fix for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants