Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

incorrect matching of strings that contain unicode #3

Open
mcmtroffaes opened this issue Apr 21, 2014 · 9 comments
Open

incorrect matching of strings that contain unicode #3

mcmtroffaes opened this issue Apr 21, 2014 · 9 comments

Comments

@mcmtroffaes
Copy link
Contributor

import Text.Regex.PCRE

main :: IO ()
main = do
  putStrLn ("R X" =~ "[^ ]+")
  putStrLn ("ℝ X" =~ "[^ ]+")

returns

R
ℝ X

The first is correct, but the second is wrong, it should only return .

I tried fixing this, and I think (but cannot confirm) that SUPPORT_UCP, SUPPORT_UTF, and perhaps also SUPPORT_PCRE8 need to be defined in pcre/config.h to ensure that pcre is compiled with unicode support. (Unfortunately, haskell complained about an unknown symbol when I tried compiling with these options, and I failed to track down the exact cause of this problem.)

@mcmtroffaes
Copy link
Contributor Author

I have a patch enabling utf support in the build that fixes the unknown symbol error - will submit a pull request in a moment. Unfortunately that patch is not enough to fix the above bug...

@mcmtroffaes
Copy link
Contributor Author

... another follow up: adding compUTF8 to the default flags does not fix the problem either (this probably should happen).

I've now also checked that this is not a bug in PCRE (at least not my system PCRE, which has version 8.32). For the sake of completeness, below is the test code. I got the byte sequence for ℝ by running python3 -c 'print("ℝ".encode())'.

#include <pcre.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main(int argc, char *argv[]) {
  pcre *reCompiled;
  int pcreExecRet;
  int subStrVec[30];
  const char *pcreErrorStr;
  int pcreErrorOffset;
  char *aStrRegex;
  char **aLineToMatch;
  const char *psubStrMatchStr;
  int j;
  char *testStrings[] = { "R X",
                          "\xe2\x84\x9d X",
                          NULL};


  aStrRegex = "[^ ]+";
  reCompiled = pcre_compile(aStrRegex, PCRE_UTF8, &pcreErrorStr, &pcreErrorOffset, NULL);
  if(reCompiled == NULL) {
    printf("ERROR: Could not compile '%s': %s\n", aStrRegex, pcreErrorStr);
    exit(1);
  }

  for(aLineToMatch=testStrings; *aLineToMatch != NULL; aLineToMatch++) {
    printf("String: %s\n", *aLineToMatch);
    pcreExecRet = pcre_exec(reCompiled,
                            NULL,
                            *aLineToMatch,
                            strlen(*aLineToMatch),  // length of string
                            0,                      // Start index
                            0,                      // options
                            subStrVec,
                            30);                    // length of subStrVec

    if(pcreExecRet < 0) { // Something bad happened..
      switch(pcreExecRet) {
      case PCRE_ERROR_NOMATCH      : printf("String did not match the pattern\n");        break;
      default                      : printf("Unknown error\n");
          break;
      }
    } else {
      printf("Result: We have a match!\n");
      for(j=0; j<pcreExecRet; j++) {
        pcre_get_substring(*aLineToMatch, subStrVec, pcreExecRet, j, &(psubStrMatchStr));
        printf("Match(%2d/%2d): (%2d,%2d): '%s'\n", j, pcreExecRet-1, subStrVec[j*2], subStrVec[j*2+1], psubStrMatchStr);
      }

      pcre_free_substring(psubStrMatchStr);
    }
    printf("\n");
  }

  pcre_free(reCompiled);
  return 0;
}

@audreyt
Copy link
Owner

audreyt commented Apr 22, 2014

Thanks for your continued investigation, and sorry for the late responses — I was busy with https://0sdc.tw/en the past few weeks. Will be happy to release once you feel it's at a good point, and perhaps we can also feed the changes upstream to regex-pcre.

@mcmtroffaes
Copy link
Contributor Author

You're welcome - I did not think your response was in any way late. :-) Also thanks for the quick merge of my pull request.

I've tracked this down a little bit further still.

A next possible cause for the bug is that the Foreign.C.String methods used in String.hs (peekCString and withCString) are locale dependent. To fix this, we could marshal instead between ByteString and CString, via packCString and useAsCString (from Data.ByteString), and do the utf8 encoding/decoding explicitly, via fromString and toString (from Data.ByteString.UTF8). Do you agree that this is a good approach?

Would it be better if I tried to work with upstream directly? (I couldn't find their repository...)

@mcmtroffaes
Copy link
Contributor Author

I think I nailed the cause of the bug. PCRE returns offsets of matched substrings into the utf8 encoded byte string. The wrapper implicitly assumes that string offsets (counting characters) are the same as byte offsets (counting bytes); this only works if there are no multibyte UTF8 characters.

The solution is to translate byte offsets back to string offsets. The decode function documented here may provide a way to achieve this.

@cdornan
Copy link

cdornan commented Jun 8, 2017

@mcmtroffaes @audreyt I had separately come to the same conclusion — see iconnect/regex#141 and the correction code.

@cdornan
Copy link

cdornan commented Jun 10, 2017

@audreyt to get this all working properly I am rebuilding regex-pcre/regex-pcre-builtin within regex. Please don't hesitate to drop me a line if you are interested in any aspect of this.

@chekoopa
Copy link

chekoopa commented Jul 17, 2019

A little bump about the capturedText function.
It doesn't take an account on offsets and lengths (which are indeed correct thanks to the fix, double-checked that), so you still have to extract the captured text manually (T.take len $ T.take offset $ fullText). Shot my leg a few times because of that.

*doing a dice roll for trying to fix it on one's own*

UPD: a yet relevant issue thread, but wrong repo
UPD2: yes, a new issue

@mnn
Copy link

mnn commented Dec 5, 2019

I am guessing I hit this issue:

> "@býci#" & (?=~ [reBI|býci|plemenice|]) & matchedText <&> T.unpack & fromMaybe "???" & putStrLn
býci#

It captured symbol # incorrectly. Thinking about, I am not sure ý is actually UTF8 😶. Probably?

Edit: Tested on Linux, library version is 1.0.2.0 (from Stack).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants