Skip to content
This repository has been archived by the owner on Jul 31, 2020. It is now read-only.

Add unicode support. #42

Closed
wants to merge 2 commits into from
Closed

Add unicode support. #42

wants to merge 2 commits into from

Conversation

mcmtroffaes
Copy link
Contributor

The PCRE String backend does not properly support unicode, however the ByteString backend does, provided that PCRE is built with UTF8 support. UTF8 support is not yet released in regex-pcre-builtin but the support has been merged: see audreyt/regex-pcre-builtin#3 and audreyt/regex-pcre-builtin#4 for further background.

This patch enables the use of the ByteString backend and properly converts between UTF8 ByteStrings and Strings.

Proper unicode support is needed at least for Agda syntax highlighting (see https://git.reviewboard.kde.org/r/117167/), but there may be more syntax files that might rely on this (e.g. Haskell).

@jgm
Copy link
Owner

jgm commented May 10, 2014

Have you measured performance impact?

@mcmtroffaes
Copy link
Contributor Author

$ time ./dist/build/Highlight/Highlight -s xml xml/scala.xml > /dev/null

master branch:

real    0m1.142s
user    0m1.130s
sys     0m0.007s

feature/unicode-support branch:

real    0m1.155s
user    0m1.132s
sys     0m0.016s

This is no full statistical benchmark analysis but it does show that for this particular file, the impact is hardly noticable. Actually I had expected it to be larger.

Aside, not relevant for this pull request, but version 0.5.6 works much faster than the current master:

real    0m0.312s
user    0m0.286s
sys     0m0.024s

So it looks like there has been a recent performance regression. (I stumbled on this by coincidence by running the test on my distribution's Highlight, which was at 0.5.6.)

@jgm
Copy link
Owner

jgm commented May 12, 2014

It's reassuring that the unicode support doesn't slow things down that much.

As for the other performance regression you uncovered, that's odd.
I've determined that it comes in between 0.5.6.1 (which is fast)
and 0.5.7 (which is slow).

@jgm
Copy link
Owner

jgm commented May 12, 2014

Indeed the culprit appears to be commit
f239731

And looking into this further made me realize that something which my
earlier, beginner Haskell self thought would avoid repeated regex
compilations actually does not. So I see potential here for
dramatically improving h-k's performance, with a bit of rewriting!

@jgm
Copy link
Owner

jgm commented May 12, 2014

Messing around with this some more, I find that
just adding

st <- getState

to compileRegex slows things down by about a factor of 2!
Even if you don't do anything with the state. So that explains
the slowdown.

(I added some code to cache the compiled regexes, but because
it requires the cache to be read from state, it is still slow.)

Maybe this problem would go away if we compiled with optimizations.
I have disabled them in the cabal file because with optimizations
compiling highlighting-kate fails on a lot of machines due to resource
exhaustion.

jgm added a commit that referenced this pull request May 12, 2014
This patch includes everything from the patch by mcmtroffaes
in #42 except the compUTF8 option itself, which is commented
out pending release of a version of pcre-regex-builtin that supports
it.

When a supporting version is released, we can remove the comment
here, conditionally on the version of pcre-regex-builtin.

See #42.
@jgm
Copy link
Owner

jgm commented May 12, 2014

The performance issue is handled by 2240f6f.

@mcmtroffaes
Copy link
Contributor Author

Awesome, thanks a lot for your efforts!

@audreyt
Copy link

audreyt commented May 12, 2014

The changes in the regex-pcre-builtin side is released to hackage as of 0.94.4.8.8.35.
Thanks @mcmtroffaes and @jgm !

@jgm jgm closed this in 46bfd1c May 12, 2014
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants