-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Regex improvements #4002
RFC: Regex improvements #4002
Conversation
match::ByteString | ||
captures::Vector{Union(Nothing,ByteString)} | ||
match::SubString | ||
captures::Vector{Union(Nothing,SubString)} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be more specific since these will always be SubStrings of ByteStrings – maybe we should convert all strings to UTF8String first and then this can be SubString{UTF8String}
, which is completely concrete. Of course, there's still the Nothing
issue, but at least in the cases where the match isn't Nothing
we'll know exactly what it is.
This is great stuff. Thanks so much for doing it. |
very cool! |
I get this when I run JULIA test/all
ERROR: no method connect(SubString{ASCIIString},Uint16)
in create_worker at multi.jl:1014
in start_cluster_workers at multi.jl:978
in addprocs_internal at multi.jl:1161
in addprocs at multi.jl:1164
in include_from_node1 at loading.jl:92
in process_options at client.jl:274
in _start at client.jl:349
at /Users/stefan/projects/julia/test/runtests.jl:14 Might not show up on systems with only one core. |
Travis is getting the same error: https://travis-ci.org/JuliaLang/julia/jobs/10039876#L697. |
Ugh, I forgot to run the non-regex tests. Fixed now. |
I'm actually fixing this by generalizing connect's interface, which is unduly type-restrictive. This ended up cascading down into other functions like getaddrinfo, which ought to accept any kind of string, transcoding to ASCII as needed. |
Ok, I'll rebase this after you've rearranged the types in connect. |
We've both independently fixed this problem now. |
You probably don't need to rebase, but I guess there's no harm if you prefer. |
One of the builds passed and the other seems to have an unrelated failure. Merging. |
RFC: Regex improvements
This is a few changes I'd like to make to regex, which I can pick apart if there's not a consensus. This includes:
matchall
function that returns aSubString
array. On John's benchmark in matchall is very slow #3719 elapsed time goes from 5.63 seconds to 0.16 seconds. This is also about twice as fast as python. The version here fixes the code I originally posted, handling the weird empty match cases thateachmatch
handles.match
returns and capturesSubString
objects.eachmatch
andmatchall
(but not in match). This was a pretty significant improvement in the matchall is very slow #3719 benchmark. I know in Use jit in pcre #324 it was decided not to use the jit, but using it just ineachmatch
andmatchall
avoids the serialization issue and is more likely to have a pay off since the pattern is typically going to be applied many times.eachmatch
E.g. this previously incorrectly threw an exception:collect(eachmatch(r"a?b?", "asbd", true))
.