Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add conditional and passmissing #89

Merged
merged 7 commits into from
Jan 17, 2019
Merged

Conversation

bkamins
Copy link
Member

@bkamins bkamins commented Aug 30, 2018

This is a proposal of implementation of JuliaLang/julia#26661. Actually the best name for conditional would be ifelse but unfortunately it cannot be extended.

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Is performance always good or are there allocations in some cases?

src/Missings.jl Outdated Show resolved Hide resolved
src/Missings.jl Outdated
P(xs...; kw...) ? X(xs...; kw...) : Y(xs...; kw...)

_passmissing_predicate(xs...;kw...) =
any(ismissing.(xs)) || any(ismissing.(values(values(kw))))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's more efficient to pass ismissing as a function argument to any rather than broadcasting, i.e.

    any(ismissing, xs) || any(ismissing, values(values(kw)))

Copy link
Member

@nalimilan nalimilan Aug 31, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. In any case, this really needs to be unrolled, specializing on the number of arguments, or it will be too slow. EDIT: I mean that the compiler should hopefully be able to do that automatically.

src/Missings.jl Outdated
missing
2.0
"""
passmissing(f::Base.Callable) = PassMissing{f}()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be typeof(f)?

BTW, have you done some benchmarking to check there are no allocations?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK if I wanted to use typeof{f} then I would have to store f in a struct which would be slower and would allocate more.

I will post benchmarks for the current implementation in the main thread so that they do not disappear.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in most practical uses, the struct would never allocate. It's the same idea w/ the new iteration protocol: every call to iterate potentially returns a Tuple{T...}, but most of the time those tuples and even intermediate allocated objects don't even get allocated.


# Examples
```jldoctest
julia> passmissing(sqrt).([missing, 4])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be useful to start with a simpler example with a single scalar (first non-missing, and then missing).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

@bkamins
Copy link
Member Author

bkamins commented Sep 8, 2018

Here are the benchmarks. In general, if I understand it correctly, splatting slows it down in more complex scenarios. Fortunately for single positional argument this is fast (and I would say this is a most common case). We could define functions like:

(::PassMissing{F})(x1, x2) where {F} =
    ismissing(x1) || ismissing(x2) ? missing : F(x1, x2)

to handle more complex cases faster. I do not know if we want to as, hopefully, Julia should be able to handle such splatting as used in this PR fast (not sure though).

Benchmark 1: simple scenario

julia> checkedsqrt(x) = x isa Missing ? missing : sqrt(x)
checkedsqrt (generic function with 1 method)

julia> @btime checkedsqrt(4.0)
  1.866 ns (0 allocations: 0 bytes)
2.0

julia> @btime checkedsqrt(missing)
  0.001 ns (0 allocations: 0 bytes)
missing

julia> checkedsqrt2(x) = passmissing(sqrt)(x)
checkedsqrt2 (generic function with 1 method)

julia> @btime checkedsqrt2(4.0)
  1.866 ns (0 allocations: 0 bytes)
2.0

julia> @btime checkedsqrt2(missing)
  0.001 ns (0 allocations: 0 bytes)
missing

Benchmark 2: complex scenario

julia> parsemissing(t, s, b) = t isa Missing || s isa Missing || b isa Missing ? missing : parse(t, s, base=b)
parsemissing (generic function with 1 method)

julia> @btime parsemissing(Int, "1234", 5)
  89.091 ns (0 allocations: 0 bytes)
194

julia> @btime parsemissing(Int, "1234", missing)
  0.001 ns (0 allocations: 0 bytes)
missing

julia> @btime parsemissing(missing, "1234", 5)
  0.001 ns (0 allocations: 0 bytes)
missing

julia> parsemissing2(t, s, b) = passmissing(parse)(t, s, base=b)
parsemissing2 (generic function with 1 method)

julia> @btime parsemissing2(Int, "1234", 5)
  752.883 ns (4 allocations: 96 bytes)
194

EDIT: removed wrong copy-paste

@bkamins
Copy link
Member Author

bkamins commented Sep 8, 2018

Using @generated improves performance a bit.

@bkamins
Copy link
Member Author

bkamins commented Sep 8, 2018

Of course some independent benchmarks are welcome (but at least looking at @code_warntype the generated version is type-inferred correctly as should be expected).

CC @pszufe

Copy link
Member

@quinnj quinnj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the implementation could be a bit more idiomatic (I left some comments w/ suggestions).

@@ -1,7 +1,7 @@
module Missings

export allowmissing, disallowmissing, ismissing, missing, missings,
Missing, MissingException, levels, coalesce
Missing, MissingException, levels, coalesce, passmissing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally prefer Missings.propagate if we're still bikeshedding the name here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like propagate too, the only gripe is that it's quite long. But well...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But do we want to export it or not? If yes then propagate is ok, but Missings.propagate is a bit long.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately propagate is quite general to claim it for this feature...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know, but that is why I thought the idea was propagatemissing which tab-completes.

src/Missings.jl Outdated
@@ -165,4 +165,30 @@ function levels(x)
levs
end

struct PassMissing{F} <: Function end

@generated (::PassMissing{F})(xs...;kw...) where {F} =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the @generated is doing anything here except ensuring a new function gets compiled w/ every call.

src/Missings.jl Outdated
@@ -165,4 +165,30 @@ function levels(x)
levs
end

struct PassMissing{F} <: Function end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we probably want:

struct PassMissing{F} <: Function
    f::F
end

src/Missings.jl Outdated
missing
2.0
"""
passmissing(f::Base.Callable) = PassMissing{f}()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in most practical uses, the struct would never allocate. It's the same idea w/ the new iteration protocol: every call to iterate potentially returns a Tuple{T...}, but most of the time those tuples and even intermediate allocated objects don't even get allocated.

src/Missings.jl Outdated
struct PassMissing{F} <: Function end

@generated (::PassMissing{F})(xs...;kw...) where {F} =
:(any(ismissing, xs) || any(ismissing, values(values(kw))) ? missing : F(xs...; kw...))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, the inclusion of keyword arguments here seems off to me; what if missing is a totally valid thing to pass as a keyword argument, but now it's interacting weird w/ PassMissing because it always makes my result missing?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my stab at this definition would be more along the lines of:

struct PassMissing{F} <: Function                                                                                             
    f::F                                                                                                                      
end                                                                                                                           
                                                                                                                              
function (f::PassMissing{F})(x) where {F}                                                                                     
    if @generated                                                                                                             
        return x === Missing ? missing : :(f.f(x))                                                                            
    else                                                                                                                      
        return x === missing ? missing : f.f(x)                                                                               
    end                                                                                                                       
end                                                                                                                           
                                                                                                                              
function (f::PassMissing{F})(xs...; kw...) where {F}                                                                          
    if @generated                                                                                                             
        for T in xs                                                                                                           
            T === Missing && return missing                                                                                   
        end                                                                                                                   
        return :(f.f(xs...; kw...))                                                                                           
    else                                                                                                                      
        return any(ismissing, xs) ? missing : f.f(xs...; kw...)                                                               
    end                                                                                                                       
end 

with specialize methods for 1 argument, 2 argument, maybe 3.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. Though with generated functions you don't need to specialize on the number of arguments.

Also why do you think keyword arguments should be excluded from the missingness check?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the comments. I have done additional benchmarking and it looks you are right with the optimizations 👍 (I did some tests earlier that did allocate).

I do not understand how and why this if @generated part works.
In particular e.g. why is it different from just writing:

@generated function (f::PassMissing{F})(x) where {F}                                                                                     
        return x === Missing ? missing : :(f.f(x))                                                                            
end       

Could you please explain?

Regarding other issues:

  1. function name: I am open to anything (not a native)
  2. I was also unclear if we want to include keyword arguments as in general it is somewhat arbitrary what user sets as positional argument and what as keyword argument (especially as kwargs do not require to have default values).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The if @generated part is just a future-proof against a time when you're able to statically compile a julia program into an executable that can run without LLVM (i.e. w/o runtime compilation). The else block just provides a path that would be executed if the compiler encountered an already non-generated/compiled method for the given arguments at runtime.

For keyword arguments, it just feels more arbitrary. Like, it's hard to imagine a case where I'd be maping over some Vector{Union{Float64, Missing}} and Vector{Union{Int, Missing}} and be doing something like map((x, y)->round(x; digits=y), zip(A, B)). i.e. when would I maybe pass a value vs. maybe pass a missing as a keyword argument? Keyword arguments also tend to use various "sentinel" values as signals or special values for the function to use, including missing; the danger there being that someone writes a function foo(x...; sentinel=nothing), but when the user tries to call Missings.propagate(foo)(x; sentinel=missing), the entire result comes back missing instead of passing missing on to foo as a valid sentinel value.

If anything, I'd say we leave keyword arguments out of the mix for now since they can always be added later.

My main issue with passmissing is that pass usually has a connotation of "ignore" or "do nothing" (see python's pass), whereas we're not really ignoring missing values, we're propagating them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For keyword arguments, it just feels more arbitrary. Like, it's hard to imagine a case where I'd be maping over some Vector{Union{Float64, Missing}} and Vector{Union{Int, Missing}} and be doing something like map((x, y)->round(x; digits=y), zip(A, B)). i.e. when would I maybe pass a value vs. maybe pass a missing as a keyword argument? Keyword arguments also tend to use various "sentinel" values as signals or special values for the function to use, including missing; the danger there being that someone writes a function foo(x...; sentinel=nothing), but when the user tries to call Missings.propagate(foo)(x; sentinel=missing), the entire result comes back missing instead of passing missing on to foo as a valid sentinel value.

I don't think there are any examples of functions using missing as a sentinel currently, right? We rather want people to use nothing for that.

I see your point about keyword arguments generally being about options, while the data is passed as positional arguments, but I'd find it problematic to make this a rule. Overall it doesn't seem we have strong use cases for either behavior, and in such situations I tend to favor the simplest rule (i.e. "return missing if one of the arguments is missing").

If anything, I'd say we leave keyword arguments out of the mix for now since they can always be added later.

The problem is, changing that would be breaking. The only way to be able to choose any behavior later is to throw an error if a keyword argument is missing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only way to be able to choose any behavior later is to throw an error if a keyword argument is missing.

We could define lift for functions that do not take keyword arguments for the time being. It is easy enough to wrap a function requiring kwargs in an anonymous function.

@tcovert
Copy link

tcovert commented Jan 6, 2019

Did this ever make its way into master? Having a consistent way to "wrap" routines that might have to deal with missing values in DataFrames data cleaning would be pretty nice. Right now, the absence of missing-aware Date methods is kind of a pain...

@bkamins
Copy link
Member Author

bkamins commented Jan 6, 2019

I was not clear what the best API would be so I left it hanging.

@tcovert do you have any opinion what would be best given the discussion above?

@bkamins
Copy link
Member Author

bkamins commented Jan 8, 2019

Having looked at the options I recommend:

  1. to use the design proposed by @quinnj
  2. not to accept keyword arguments at all; we can add this support later (that change will be non-breaking) and there is no strong consensus what to do with them for now; I would say that most use cases will not involve keyword arguments anyway

@nalimilan I we accept this PR the side effect will be that DataFrames.jl will re-export this (which is actually nice I think) - I will add it to the documentation and tutorial.

@codecov-io
Copy link

codecov-io commented Jan 8, 2019

Codecov Report

Merging #89 into master will decrease coverage by 11.53%.
The diff coverage is 80%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master      #89       +/-   ##
===========================================
- Coverage     100%   88.46%   -11.54%     
===========================================
  Files           1        1               
  Lines          38       52       +14     
===========================================
+ Hits           38       46        +8     
- Misses          0        6        +6
Impacted Files Coverage Δ
src/Missings.jl 88.46% <80%> (-11.54%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1f37ce9...8c0bd5d. Read the comment docs.

@bkamins
Copy link
Member Author

bkamins commented Jan 8, 2019

codecov gave a strange result (and the else part of if @generated is not covered by tests)

@nalimilan
Copy link
Member

Sounds good for now. Can you document (and possibly test) that keyword arguments are not supported?

@bkamins
Copy link
Member Author

bkamins commented Jan 9, 2019

Added. Thanks for looking at it.

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@bkamins
Copy link
Member Author

bkamins commented Jan 17, 2019

Do we want to merge it or wait for more feedback?

@nalimilan nalimilan merged commit 9cca49f into JuliaData:master Jan 17, 2019
@nalimilan
Copy link
Member

There's been plenty of time for feedback, and the PR throws an error for keyword arguments, so we can discuss that later.

@bkamins bkamins deleted the passmissing branch January 17, 2019 15:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants