Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

match is type unstable #10550

Closed
sjkelly opened this issue Mar 18, 2015 · 21 comments
Closed

match is type unstable #10550

sjkelly opened this issue Mar 18, 2015 · 21 comments
Labels
missing data Base.missing and related functionality

Comments

@sjkelly
Copy link
Contributor

sjkelly commented Mar 18, 2015

I feel like this a duplicate issue, but I could not find it. A little ironic I guess.

julia> @code_warntype match(r"([M,G])([0-9.])+", "foo")
Variables:
  r::Regex
  s::ASCIIString

Body:
  begin  # regex.jl, line 134:
      GenSym(2) = box(Int32,checked_trunc_sint(Int32,0))
      GenSym(8) = (top(getfield))(s::ASCIIString,:data)::Array{UInt8,1}
      return match(r::Regex,$(Expr(:new, :((top(getfield))(Core,:UTF8String)::Type{UTF8String}), :((top(convert))(Array{UInt8,1},GenSym(8))))),1,GenSym(2))::Union(RegexMatch,Void)
  end::Union(RegexMatch,Void)

This behavior is documented, but I feel like it should not be the default for performance reasons. This seems like a use case for Nullables.

@sjkelly
Copy link
Contributor Author

sjkelly commented Mar 18, 2015

I should add that I am willing to work on this is there is some general consesus to change it.

@hayd
Copy link
Member

hayd commented Mar 18, 2015

+1... but I'm not sure how you can do this without breaking lots of existing code.

Edit: thinking about it pandas made this change to their str.match method, for the same reason.

@nalimilan
Copy link
Member

+1

@nalimilan
Copy link
Member

BTW, I've just bumped on the same pattern with Pkg.installed(). Performance is probably not critical for this kind of function, but if you start working on it you may want to look for other functions like this.

@StefanKarpinski
Copy link
Member

The big issue here is design. How do you design the equivalent functionality in a type-stable way? One possibility is using the relatively new Nullable type and always returning a Nullable. Still kind of awkward though.

I had at one point proposed an else clause for for loops that execute zero times and then writing this like so:

for m = match(r"([M,G])([0-9.])+", "foo")
    # do stuff with m
else
    # handle not matching
end

Then match would have to return an iterable object that matches zero or one times. We could also just do matches or something like that and if you only want one match, you break after the first match.

@quinnj
Copy link
Member

quinnj commented Mar 18, 2015

I think this is a great case for Nullable, it's actually been on my todo
list for a while to rework this, but if someone else wants to take a crack
at it, feel free!

On Wed, Mar 18, 2015 at 5:19 AM, Stefan Karpinski [email protected]
wrote:

The big issue here is design. How do you design the equivalent
functionality in a type-stable way? One possibility is using the relatively
new Nullable type and always returning a Nullable. Still kind of awkward
though.

I had at one point proposed an else clause for for loops that execute
zero times and then writing this like so:

for m = match(r"([M,G])([0-9.])+", "foo")
# do stuff with melse
# handle not matchingend

Then match would have to return an iterable object that matches zero or
one times. We could also just do matches or something like that and if
you only want one match, you break after the first match.


Reply to this email directly or view it on GitHub
#10550 (comment).

@tkelman
Copy link
Contributor

tkelman commented Mar 18, 2015

as mentioned here #1820 (comment) there's also the new pcre2 api and features to think about, if you're digging into regex functionality

@MikeInnes
Copy link
Member

I like the idea of for/else. It's a nice construct that would remove some ugly flags in various bits of code. I don't see any real downsides to having it, since it's not like it uses an extra keyword.

@nalimilan
Copy link
Member

Returning an iterable is also an appealing alternative, if it can be made as efficient as match when you only want the first match. That would allow merging match and matchall, and unify their return types.

EDIT: There's also search, which currently returns an empty range when no match is found, and thus would work with the for/else pattern.

@kmsquire
Copy link
Member

A while ago, I had implemented this as a do function (which was equivalent to @StefanKarpinski's suggestion, but without handling else). I liked it, but it wasn't accepted.

This behavior was actually one of my main motivations for creating Match.jl. If I had the time to learn lisp better, I would love to try to get some of that matching into the parser directly. (Any takers?)

@simonster
Copy link
Member

From a performance perspective, an iterable may be better than a Nullable. Until we have escape analysis, constructing a Nullable{RegexMatch} requires an additional allocation on the heap. (OTOH there are already so many heap allocated things inside a RegexMatch that maybe this doesn't matter.)

@simonster
Copy link
Member

There may not be a big performance advantage to making match type-stable anyway. There is no type ambiguity when accessing the fields of the return value since those fields don't exist on nothing.

@johnmyleswhite
Copy link
Member

My personal sense: the most important thing is that we standardize on a idiom for all functions that return results that may not exist. Nullable seems like the simplest approach, but I don't really care what we do as long as we always do the same thing. We can easily make Nullable iterable: this has been discussed before.

The heap allocation issue raised by @simonster is more serious, but I think we need to solve it since the use cases for something equivalent to Nullable are going to keep piling up.

@MikeInnes
Copy link
Member

In theory you could certainly work out that (m::Union(RegexMatch, Nothing)).match is a string, [and in practice, too!]

but I don't think the compiler is smart enough to do that yet. My understanding is that unions are equivalent to Any as far as optimisations go.

@johnmyleswhite
Copy link
Member

I think it's more complicated than that: Union(T, Union()) is the same as Union(T) as far as I understand.

@simonster
Copy link
Member

@one-more-minute Type inference is smart enough to figure that out:

julia> function f(str)
       x = match(r"\w", str)
       x.match
       end;

julia> code_typed(f, (UTF8String,))
1-element Array{Any,1}:
 :($(Expr(:lambda, Any[:str], Any[Any[:x],Any[Any[:str,UTF8String,0],Any[:x,Union(RegexMatch,Void),18]],Any[],Any[UTF8String,Regex,UTF8String,Regex]], :(begin  # none, line 1:
        GenSym(1) = r"\w"
        x = match(GenSym(1),str::UTF8String,1,(top(box))(Int32,(top(checked_trunc_sint))(Int32,0)))::Union(RegexMatch,Void) # line 2:
        return (top(getfield))(x::Union(RegexMatch,Void),:match)::SubString{UTF8String}
    end::SubString{UTF8String}))))

@johnmyleswhite typeof(nothing) is Void not Union(), but type inference knows that getfield(::Union(RegexMatch,Void),:match) can only return SubString{UTF8String} because there's no match field on nothing. So there's no dynamic dispatch in the IR and no additional allocation. The only overhead is that the field access occurs via jl_f_get_field instead of via a direct memory access, and this does not seem to be very expensive. Adding a type annotation to the result of match gives roughly a 3% improvement in performance for me.

@MikeInnes
Copy link
Member

julia> type Foo x::Int end

julia> type Bar y::Float64 end

julia> foo() = rand()<0.5? Foo(1) : Bar(1)
foo (generic function with 1 method)

julia> bar() = foo().x
bar (generic function with 1 method)

julia> @code_typed bar()
1-element Array{Any,1}:
 :($(Expr(:lambda, Any[], Any[Any[],Any[],Any[],Any[]], :(begin  # none, line 1:
        return (top(getfield))(foo()::Union(Foo,Bar),:x)::Int64
    end::Int64))))

That's pretty sweet, I had no idea we could do that kind of inference on field names. It even types it as Union() if the field doesn't exist, and correctly types it as Union(Float64, Int64) if both fields are named x.

@JeffBezanson
Copy link
Member

Cool isn't it :D

@oxinabox
Copy link
Contributor

Keen for this, it just catches me out everytime, because its not error I expect.

I wonder though if match should just throw an exception if it does not match.
and if try_match should return Nullable{RegexMatch} (as in #12055).

I think a Regex not matching is an exceptionable situation.
If you are only interested in if it matches or not, then you would have used ismatch.
So in general when the programmer calls match they believe it does match,
and they want more details about how it matches -- what are the capture groups, etc.

@yuyichao
Copy link
Contributor

So in general when the programmer calls match they believe it does match,
and they want more details about how it matches -- what are the capture groups, etc.

I don't think this is true. In my experience it is very common to first check if the string matches and then do further processing using the information of the matching.

@nalimilan nalimilan added the missing data Base.missing and related functionality label Sep 6, 2016
@KristofferC
Copy link
Member

KristofferC commented Apr 4, 2018

I think this can be closed now because in a turn of events this now matches what other functions do (e.g. findfirst).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
missing data Base.missing and related functionality
Projects
None yet
Development

No branches or pull requests