Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

define getindex on regex matches to return captures. #11566

Merged
merged 4 commits into from
Jul 2, 2015
Merged
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 19 additions & 0 deletions base/pcre.jl
Original file line number Diff line number Diff line change
@@ -140,4 +140,23 @@ function substring_number_from_name(re, name)
(Ptr{Void}, Cstring), re, name)
end

function capture_names(re)
name_count = info(re, INFO_NAMECOUNT, UInt32)
name_entry_size = info(re, INFO_NAMEENTRYSIZE, UInt32)
nametable_ptr = info(re, INFO_NAMETABLE, Ptr{UInt8})
names = Dict{Int, ASCIIString}()
for i=1:name_count
offset = (i-1)*name_entry_size + 1
# The capture group index corresponding to name 'i' is stored as a
# big-endian 16-bit value.
high_byte = UInt16(unsafe_load(nametable_ptr, offset))
low_byte = UInt16(unsafe_load(nametable_ptr, offset+1))
idx = (high_byte << 8) | low_byte
# The capture group name is a null-terminated string located directly
# after the index.
names[idx] = bytestring(nametable_ptr+offset+1)
end
names
end

end # module
24 changes: 21 additions & 3 deletions base/regex.jl
Original file line number Diff line number Diff line change
@@ -15,6 +15,8 @@ type Regex
extra::Ptr{Void}
ovec::Vector{Csize_t}
match_data::Ptr{Void}
capture_name_to_idx::Dict{Symbol, Int}
idx_to_capture_name::Dict{Int, Symbol}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it would be better to do the computation on-demand rather than caching it. a linear scan over the nametable will probably be the same time-performance as this dict lookup, but considerably less memory.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. This is not going to be something you want to use for truly high-performance code anyway – for that you'll want to do the indexed lookup.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify, are you proposing eliminating both the index->name and the name->index dict from the Regex type, or just one of them?

I guess I didn't think memory was a significant concern here since a user is probably not creating a lot of regex objects and even if they did, the nametable dictionary be a small increase in the total memory taken up by the regex object (which includes the original regex and the JITed regex program). For the few dozen bytes it takes to store the nametable in the regex object, you get much faster capture group extraction from match objects verse re-extracting the nametable from PCRE's internal representation.

I don't feel strongly though so I'll implement whatever you guys think is best.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both, and mostly because doing the linear scan against the internal representation should be at least as fast as this Dict lookup for all reasonable regexes. it's fine to ccall strncmp for this, so you don't need to constantly re-extract the table into a julia string.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, makes sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking at the pcre2 api today, it looks like there is a function for doing exactly this: pcre2_substring_number_from_name_8

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if we're not going the route of parsing the name table at regex compile time into a native Julia representation, then we might as well use the PCRE convenience methods.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, the performance penalty is probably negligible/non-existent for using the convenience methods even if we did extract the whole name table at compile time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I'm no longer caching the capture names in the regex table. The only time memory is allocated now is when show is called on a Match object.



function Regex(pattern::AbstractString, compile_options::Integer,
@@ -29,7 +31,8 @@ type Regex
throw(ArgumentError("invalid regex match options: $match_options"))
end
re = compile(new(pattern, compile_options, match_options, C_NULL,
C_NULL, Csize_t[], C_NULL))
C_NULL, Csize_t[], C_NULL,
Dict{Symbol, Int}(), Dict{Int, Symbol}()))
finalizer(re, re->begin
re.regex == C_NULL || PCRE.free_re(re.regex)
re.match_data == C_NULL || PCRE.free_match_data(re.match_data)
@@ -57,6 +60,10 @@ function compile(regex::Regex)
PCRE.jit_compile(regex.regex)
regex.match_data = PCRE.create_match_data(regex.regex)
regex.ovec = PCRE.get_ovec(regex.match_data)
for (idx, name) in PCRE.capture_names(regex.regex)
regex.capture_name_to_idx[Symbol(name)] = idx
regex.idx_to_capture_name[idx] = Symbol(name)
end
end
regex
end
@@ -92,6 +99,7 @@ immutable RegexMatch
captures::Vector{Union(Void,SubString{UTF8String})}
offset::Int
offsets::Vector{Int}
regex::Regex
end

function show(io::IO, m::RegexMatch)
@@ -100,7 +108,10 @@ function show(io::IO, m::RegexMatch)
if !isempty(m.captures)
print(io, ", ")
for i = 1:length(m.captures)
print(io, i, "=")
# If the capture group is named, show the name.
# Otherwise show its index.
capture_name = get(m.regex.idx_to_capture_name, i, i)
print(io, capture_name, "=")
show(io, m.captures[i])
if i < length(m.captures)
print(io, ", ")
@@ -110,6 +121,13 @@ function show(io::IO, m::RegexMatch)
print(io, ")")
end

# Capture group extraction
getindex(m::RegexMatch, idx::Int) = m.captures[idx]
function getindex(m::RegexMatch, name::Symbol)
m[m.regex.capture_name_to_idx[name]]
end
getindex(m::RegexMatch, name::AbstractString) = m[Symbol(name)]

function ismatch(r::Regex, s::AbstractString, offset::Integer=0)
compile(r)
return PCRE.exec(r.regex, bytestring(s), offset, r.match_options,
@@ -136,7 +154,7 @@ function match(re::Regex, str::UTF8String, idx::Integer, add_opts::UInt32=UInt32
cap = Union(Void,SubString{UTF8String})[
ovec[2i+1] == PCRE.UNSET ? nothing : SubString(str, ovec[2i+1]+1, ovec[2i+2]) for i=1:n ]
off = Int[ ovec[2i+1]+1 for i=1:n ]
RegexMatch(mat, cap, ovec[1]+1, off)
RegexMatch(mat, cap, ovec[1]+1, off, re)
end

match(re::Regex, str::Union(ByteString,SubString), idx::Integer, add_opts::UInt32=UInt32(0)) =
4 changes: 4 additions & 0 deletions test/regex.jl
Original file line number Diff line number Diff line change
@@ -37,3 +37,7 @@ show(buf, r"")
# regex match / search string must be a ByteString
@test_throws ArgumentError match(r"test", utf32("this is a test"))
@test_throws ArgumentError search(utf32("this is a test"), r"test")

# Named subpatterns
m = match(r"(?<a>.)(.)(?<b>.)", "xyz")
@test (m[:a], m[2], m["b"]) == ("x", "y", "z")