-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
incorrect matching of strings that contain unicode #3
Comments
I have a patch enabling utf support in the build that fixes the unknown symbol error - will submit a pull request in a moment. Unfortunately that patch is not enough to fix the above bug... |
... another follow up: adding compUTF8 to the default flags does not fix the problem either (this probably should happen). I've now also checked that this is not a bug in PCRE (at least not my system PCRE, which has version 8.32). For the sake of completeness, below is the test code. I got the byte sequence for ℝ by running #include <pcre.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main(int argc, char *argv[]) {
pcre *reCompiled;
int pcreExecRet;
int subStrVec[30];
const char *pcreErrorStr;
int pcreErrorOffset;
char *aStrRegex;
char **aLineToMatch;
const char *psubStrMatchStr;
int j;
char *testStrings[] = { "R X",
"\xe2\x84\x9d X",
NULL};
aStrRegex = "[^ ]+";
reCompiled = pcre_compile(aStrRegex, PCRE_UTF8, &pcreErrorStr, &pcreErrorOffset, NULL);
if(reCompiled == NULL) {
printf("ERROR: Could not compile '%s': %s\n", aStrRegex, pcreErrorStr);
exit(1);
}
for(aLineToMatch=testStrings; *aLineToMatch != NULL; aLineToMatch++) {
printf("String: %s\n", *aLineToMatch);
pcreExecRet = pcre_exec(reCompiled,
NULL,
*aLineToMatch,
strlen(*aLineToMatch), // length of string
0, // Start index
0, // options
subStrVec,
30); // length of subStrVec
if(pcreExecRet < 0) { // Something bad happened..
switch(pcreExecRet) {
case PCRE_ERROR_NOMATCH : printf("String did not match the pattern\n"); break;
default : printf("Unknown error\n");
break;
}
} else {
printf("Result: We have a match!\n");
for(j=0; j<pcreExecRet; j++) {
pcre_get_substring(*aLineToMatch, subStrVec, pcreExecRet, j, &(psubStrMatchStr));
printf("Match(%2d/%2d): (%2d,%2d): '%s'\n", j, pcreExecRet-1, subStrVec[j*2], subStrVec[j*2+1], psubStrMatchStr);
}
pcre_free_substring(psubStrMatchStr);
}
printf("\n");
}
pcre_free(reCompiled);
return 0;
} |
Thanks for your continued investigation, and sorry for the late responses — I was busy with https://0sdc.tw/en the past few weeks. Will be happy to release once you feel it's at a good point, and perhaps we can also feed the changes upstream to regex-pcre. |
You're welcome - I did not think your response was in any way late. :-) Also thanks for the quick merge of my pull request. I've tracked this down a little bit further still. A next possible cause for the bug is that the Foreign.C.String methods used in String.hs (peekCString and withCString) are locale dependent. To fix this, we could marshal instead between ByteString and CString, via packCString and useAsCString (from Data.ByteString), and do the utf8 encoding/decoding explicitly, via fromString and toString (from Data.ByteString.UTF8). Do you agree that this is a good approach? Would it be better if I tried to work with upstream directly? (I couldn't find their repository...) |
I think I nailed the cause of the bug. PCRE returns offsets of matched substrings into the utf8 encoded byte string. The wrapper implicitly assumes that string offsets (counting characters) are the same as byte offsets (counting bytes); this only works if there are no multibyte UTF8 characters. The solution is to translate byte offsets back to string offsets. The decode function documented here may provide a way to achieve this. |
@mcmtroffaes @audreyt I had separately come to the same conclusion — see iconnect/regex#141 and the correction code. |
A little bump about the *doing a dice roll for trying to fix it on one's own* UPD: a yet relevant issue thread, but wrong repo |
I am guessing I hit this issue: > "@býci#" & (?=~ [reBI|býci|plemenice|]) & matchedText <&> T.unpack & fromMaybe "???" & putStrLn
býci# It captured symbol Edit: Tested on Linux, library version is 1.0.2.0 (from Stack). |
returns
The first is correct, but the second is wrong, it should only return
ℝ
.I tried fixing this, and I think (but cannot confirm) that SUPPORT_UCP, SUPPORT_UTF, and perhaps also SUPPORT_PCRE8 need to be defined in pcre/config.h to ensure that pcre is compiled with unicode support. (Unfortunately, haskell complained about an unknown symbol when I tried compiling with these options, and I failed to track down the exact cause of this problem.)
The text was updated successfully, but these errors were encountered: