incorrect matching of strings that contain unicode #3

mcmtroffaes · 2014-04-21T11:11:19Z

import Text.Regex.PCRE

main :: IO ()
main = do
  putStrLn ("R X" =~ "[^ ]+")
  putStrLn ("ℝ X" =~ "[^ ]+")

returns

R
ℝ X

The first is correct, but the second is wrong, it should only return ℝ.

I tried fixing this, and I think (but cannot confirm) that SUPPORT_UCP, SUPPORT_UTF, and perhaps also SUPPORT_PCRE8 need to be defined in pcre/config.h to ensure that pcre is compiled with unicode support. (Unfortunately, haskell complained about an unknown symbol when I tried compiling with these options, and I failed to track down the exact cause of this problem.)

The text was updated successfully, but these errors were encountered:

mcmtroffaes · 2014-04-22T10:33:37Z

I have a patch enabling utf support in the build that fixes the unknown symbol error - will submit a pull request in a moment. Unfortunately that patch is not enough to fix the above bug...

mcmtroffaes · 2014-04-22T13:34:06Z

... another follow up: adding compUTF8 to the default flags does not fix the problem either (this probably should happen).

I've now also checked that this is not a bug in PCRE (at least not my system PCRE, which has version 8.32). For the sake of completeness, below is the test code. I got the byte sequence for ℝ by running python3 -c 'print("ℝ".encode())'.

#include <pcre.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main(int argc, char *argv[]) {
  pcre *reCompiled;
  int pcreExecRet;
  int subStrVec[30];
  const char *pcreErrorStr;
  int pcreErrorOffset;
  char *aStrRegex;
  char **aLineToMatch;
  const char *psubStrMatchStr;
  int j;
  char *testStrings[] = { "R X",
                          "\xe2\x84\x9d X",
                          NULL};


  aStrRegex = "[^ ]+";
  reCompiled = pcre_compile(aStrRegex, PCRE_UTF8, &pcreErrorStr, &pcreErrorOffset, NULL);
  if(reCompiled == NULL) {
    printf("ERROR: Could not compile '%s': %s\n", aStrRegex, pcreErrorStr);
    exit(1);
  }

  for(aLineToMatch=testStrings; *aLineToMatch != NULL; aLineToMatch++) {
    printf("String: %s\n", *aLineToMatch);
    pcreExecRet = pcre_exec(reCompiled,
                            NULL,
                            *aLineToMatch,
                            strlen(*aLineToMatch),  // length of string
                            0,                      // Start index
                            0,                      // options
                            subStrVec,
                            30);                    // length of subStrVec

    if(pcreExecRet < 0) { // Something bad happened..
      switch(pcreExecRet) {
      case PCRE_ERROR_NOMATCH      : printf("String did not match the pattern\n");        break;
      default                      : printf("Unknown error\n");
          break;
      }
    } else {
      printf("Result: We have a match!\n");
      for(j=0; j<pcreExecRet; j++) {
        pcre_get_substring(*aLineToMatch, subStrVec, pcreExecRet, j, &(psubStrMatchStr));
        printf("Match(%2d/%2d): (%2d,%2d): '%s'\n", j, pcreExecRet-1, subStrVec[j*2], subStrVec[j*2+1], psubStrMatchStr);
      }

      pcre_free_substring(psubStrMatchStr);
    }
    printf("\n");
  }

  pcre_free(reCompiled);
  return 0;
}

audreyt · 2014-04-22T20:30:12Z

Thanks for your continued investigation, and sorry for the late responses — I was busy with https://0sdc.tw/en the past few weeks. Will be happy to release once you feel it's at a good point, and perhaps we can also feed the changes upstream to regex-pcre.

mcmtroffaes · 2014-04-22T21:40:37Z

You're welcome - I did not think your response was in any way late. :-) Also thanks for the quick merge of my pull request.

I've tracked this down a little bit further still.

A next possible cause for the bug is that the Foreign.C.String methods used in String.hs (peekCString and withCString) are locale dependent. To fix this, we could marshal instead between ByteString and CString, via packCString and useAsCString (from Data.ByteString), and do the utf8 encoding/decoding explicitly, via fromString and toString (from Data.ByteString.UTF8). Do you agree that this is a good approach?

Would it be better if I tried to work with upstream directly? (I couldn't find their repository...)

mcmtroffaes · 2014-05-03T09:22:03Z

I think I nailed the cause of the bug. PCRE returns offsets of matched substrings into the utf8 encoded byte string. The wrapper implicitly assumes that string offsets (counting characters) are the same as byte offsets (counting bytes); this only works if there are no multibyte UTF8 characters.

The solution is to translate byte offsets back to string offsets. The decode function documented here may provide a way to achieve this.

cdornan · 2017-06-08T14:28:24Z

@mcmtroffaes @audreyt I had separately come to the same conclusion — see iconnect/regex#141 and the correction code.

cdornan · 2017-06-10T12:25:28Z

@audreyt to get this all working properly I am rebuilding regex-pcre/regex-pcre-builtin within regex. Please don't hesitate to drop me a line if you are interested in any aspect of this.

chekoopa · 2019-07-17T06:58:57Z

A little bump about the capturedText function.
It doesn't take an account on offsets and lengths (which are indeed correct thanks to the fix, double-checked that), so you still have to extract the captured text manually (T.take len $ T.take offset $ fullText). Shot my leg a few times because of that.

*doing a dice roll for trying to fix it on one's own*

UPD: a yet relevant issue thread, but wrong repo
UPD2: yes, a new issue

mnn · 2019-12-05T13:12:43Z

I am guessing I hit this issue:

> "@býci#" & (?=~ [reBI|býci|plemenice|]) & matchedText <&> T.unpack & fromMaybe "???" & putStrLn
býci#

It captured symbol # incorrectly. Thinking about, I am not sure ý is actually UTF8 😶. Probably?

Edit: Tested on Linux, library version is 1.0.2.0 (from Stack).

mcmtroffaes mentioned this issue Apr 22, 2014

Enable utf build. #4

Merged

mcmtroffaes mentioned this issue May 10, 2014

Add unicode support. jgm/highlighting-kate#42

Closed

cdornan mentioned this issue Jun 8, 2017

Fix PCRE with UTF-8 data on Windows iconnect/regex#145

Open

cdornan mentioned this issue Jun 13, 2017

Seq Char not working with Unicode #6

Open

ubavic mentioned this issue Dec 27, 2020

Incorrect matching of Unicode strings haskell-hvr/regex-pcre#3

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

incorrect matching of strings that contain unicode #3

incorrect matching of strings that contain unicode #3

mcmtroffaes commented Apr 21, 2014

mcmtroffaes commented Apr 22, 2014

mcmtroffaes commented Apr 22, 2014

audreyt commented Apr 22, 2014

mcmtroffaes commented Apr 22, 2014

mcmtroffaes commented May 3, 2014

cdornan commented Jun 8, 2017

cdornan commented Jun 10, 2017

chekoopa commented Jul 17, 2019 •

edited

Loading

mnn commented Dec 5, 2019 •

edited

Loading

incorrect matching of strings that contain unicode #3

incorrect matching of strings that contain unicode #3

Comments

mcmtroffaes commented Apr 21, 2014

mcmtroffaes commented Apr 22, 2014

mcmtroffaes commented Apr 22, 2014

audreyt commented Apr 22, 2014

mcmtroffaes commented Apr 22, 2014

mcmtroffaes commented May 3, 2014

cdornan commented Jun 8, 2017

cdornan commented Jun 10, 2017

chekoopa commented Jul 17, 2019 • edited Loading

mnn commented Dec 5, 2019 • edited Loading

chekoopa commented Jul 17, 2019 •

edited

Loading

mnn commented Dec 5, 2019 •

edited

Loading