Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Regex improvements #4002

Merged
merged 3 commits into from
Aug 9, 2013
Merged

RFC: Regex improvements #4002

merged 3 commits into from
Aug 9, 2013

Conversation

dcjones
Copy link
Contributor

@dcjones dcjones commented Aug 9, 2013

This is a few changes I'd like to make to regex, which I can pick apart if there's not a consensus. This includes:

  1. A dramatically faster matchall function that returns a SubString array. On John's benchmark in matchall is very slow #3719 elapsed time goes from 5.63 seconds to 0.16 seconds. This is also about twice as fast as python. The version here fixes the code I originally posted, handling the weird empty match cases that eachmatch handles.
  2. match returns and captures SubString objects.
  3. PCRE's JIT is used in eachmatch and matchall (but not in match). This was a pretty significant improvement in the matchall is very slow #3719 benchmark. I know in Use jit in pcre #324 it was decided not to use the jit, but using it just in eachmatch and matchall avoids the serialization issue and is more likely to have a pay off since the pattern is typically going to be applied many times.
  4. Fix some weird special cases in eachmatch E.g. this previously incorrectly threw an exception: collect(eachmatch(r"a?b?", "asbd", true)).

match::ByteString
captures::Vector{Union(Nothing,ByteString)}
match::SubString
captures::Vector{Union(Nothing,SubString)}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be more specific since these will always be SubStrings of ByteStrings – maybe we should convert all strings to UTF8String first and then this can be SubString{UTF8String}, which is completely concrete. Of course, there's still the Nothing issue, but at least in the cases where the match isn't Nothing we'll know exactly what it is.

@StefanKarpinski
Copy link
Member

This is great stuff. Thanks so much for doing it.

@JeffBezanson
Copy link
Member

very cool!

@StefanKarpinski
Copy link
Member

I get this when I run make testall:

    JULIA test/all
ERROR: no method connect(SubString{ASCIIString},Uint16)
 in create_worker at multi.jl:1014
 in start_cluster_workers at multi.jl:978
 in addprocs_internal at multi.jl:1161
 in addprocs at multi.jl:1164
 in include_from_node1 at loading.jl:92
 in process_options at client.jl:274
 in _start at client.jl:349
at /Users/stefan/projects/julia/test/runtests.jl:14

Might not show up on systems with only one core.

@StefanKarpinski
Copy link
Member

Travis is getting the same error: https://travis-ci.org/JuliaLang/julia/jobs/10039876#L697.

@dcjones
Copy link
Contributor Author

dcjones commented Aug 9, 2013

Ugh, I forgot to run the non-regex tests. Fixed now.

@StefanKarpinski
Copy link
Member

I'm actually fixing this by generalizing connect's interface, which is unduly type-restrictive. This ended up cascading down into other functions like getaddrinfo, which ought to accept any kind of string, transcoding to ASCII as needed.

@dcjones
Copy link
Contributor Author

dcjones commented Aug 9, 2013

Ok, I'll rebase this after you've rearranged the types in connect.

@StefanKarpinski
Copy link
Member

We've both independently fixed this problem now.

@StefanKarpinski
Copy link
Member

You probably don't need to rebase, but I guess there's no harm if you prefer.

@StefanKarpinski
Copy link
Member

One of the builds passed and the other seems to have an unrelated failure. Merging.

StefanKarpinski added a commit that referenced this pull request Aug 9, 2013
@StefanKarpinski StefanKarpinski merged commit f1326b5 into master Aug 9, 2013
@StefanKarpinski StefanKarpinski deleted the dcj/matchall branch November 16, 2013 16:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants