-
Notifications
You must be signed in to change notification settings - Fork 54
Conversation
Have you measured performance impact? |
master branch:
feature/unicode-support branch:
This is no full statistical benchmark analysis but it does show that for this particular file, the impact is hardly noticable. Actually I had expected it to be larger. Aside, not relevant for this pull request, but version 0.5.6 works much faster than the current master:
So it looks like there has been a recent performance regression. (I stumbled on this by coincidence by running the test on my distribution's Highlight, which was at 0.5.6.) |
It's reassuring that the unicode support doesn't slow things down that much. As for the other performance regression you uncovered, that's odd. |
Indeed the culprit appears to be commit And looking into this further made me realize that something which my |
Messing around with this some more, I find that
to compileRegex slows things down by about a factor of 2! (I added some code to cache the compiled regexes, but because Maybe this problem would go away if we compiled with optimizations. |
This patch includes everything from the patch by mcmtroffaes in #42 except the compUTF8 option itself, which is commented out pending release of a version of pcre-regex-builtin that supports it. When a supporting version is released, we can remove the comment here, conditionally on the version of pcre-regex-builtin. See #42.
The performance issue is handled by 2240f6f. |
Awesome, thanks a lot for your efforts! |
The changes in the regex-pcre-builtin side is released to hackage as of 0.94.4.8.8.35. |
The PCRE String backend does not properly support unicode, however the ByteString backend does, provided that PCRE is built with UTF8 support. UTF8 support is not yet released in regex-pcre-builtin but the support has been merged: see audreyt/regex-pcre-builtin#3 and audreyt/regex-pcre-builtin#4 for further background.
This patch enables the use of the ByteString backend and properly converts between UTF8 ByteStrings and Strings.
Proper unicode support is needed at least for Agda syntax highlighting (see https://git.reviewboard.kde.org/r/117167/), but there may be more syntax files that might rely on this (e.g. Haskell).