-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add conditional and passmissing #89
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Is performance always good or are there allocations in some cases?
src/Missings.jl
Outdated
P(xs...; kw...) ? X(xs...; kw...) : Y(xs...; kw...) | ||
|
||
_passmissing_predicate(xs...;kw...) = | ||
any(ismissing.(xs)) || any(ismissing.(values(values(kw)))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's more efficient to pass ismissing
as a function argument to any
rather than broadcasting, i.e.
any(ismissing, xs) || any(ismissing, values(values(kw)))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. In any case, this really needs to be unrolled, specializing on the number of arguments, or it will be too slow. EDIT: I mean that the compiler should hopefully be able to do that automatically.
src/Missings.jl
Outdated
missing | ||
2.0 | ||
""" | ||
passmissing(f::Base.Callable) = PassMissing{f}() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be typeof(f)
?
BTW, have you done some benchmarking to check there are no allocations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK if I wanted to use typeof{f}
then I would have to store f
in a struct which would be slower and would allocate more.
I will post benchmarks for the current implementation in the main thread so that they do not disappear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in most practical uses, the struct would never allocate. It's the same idea w/ the new iteration protocol: every call to iterate
potentially returns a Tuple{T...}
, but most of the time those tuples and even intermediate allocated objects don't even get allocated.
|
||
# Examples | ||
```jldoctest | ||
julia> passmissing(sqrt).([missing, 4]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be useful to start with a simpler example with a single scalar (first non-missing, and then missing).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
Here are the benchmarks. In general, if I understand it correctly, splatting slows it down in more complex scenarios. Fortunately for single positional argument this is fast (and I would say this is a most common case). We could define functions like:
to handle more complex cases faster. I do not know if we want to as, hopefully, Julia should be able to handle such splatting as used in this PR fast (not sure though). Benchmark 1: simple scenario
Benchmark 2: complex scenario
EDIT: removed wrong copy-paste |
Using |
Of course some independent benchmarks are welcome (but at least looking at CC @pszufe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the implementation could be a bit more idiomatic (I left some comments w/ suggestions).
@@ -1,7 +1,7 @@ | |||
module Missings | |||
|
|||
export allowmissing, disallowmissing, ismissing, missing, missings, | |||
Missing, MissingException, levels, coalesce | |||
Missing, MissingException, levels, coalesce, passmissing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally prefer Missings.propagate
if we're still bikeshedding the name here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like propagate
too, the only gripe is that it's quite long. But well...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But do we want to export it or not? If yes then propagate
is ok, but Missings.propagate
is a bit long.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately propagate
is quite general to claim it for this feature...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know, but that is why I thought the idea was propagatemissing
which tab-completes.
src/Missings.jl
Outdated
@@ -165,4 +165,30 @@ function levels(x) | |||
levs | |||
end | |||
|
|||
struct PassMissing{F} <: Function end | |||
|
|||
@generated (::PassMissing{F})(xs...;kw...) where {F} = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the @generated
is doing anything here except ensuring a new function gets compiled w/ every call.
src/Missings.jl
Outdated
@@ -165,4 +165,30 @@ function levels(x) | |||
levs | |||
end | |||
|
|||
struct PassMissing{F} <: Function end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we probably want:
struct PassMissing{F} <: Function
f::F
end
src/Missings.jl
Outdated
missing | ||
2.0 | ||
""" | ||
passmissing(f::Base.Callable) = PassMissing{f}() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in most practical uses, the struct would never allocate. It's the same idea w/ the new iteration protocol: every call to iterate
potentially returns a Tuple{T...}
, but most of the time those tuples and even intermediate allocated objects don't even get allocated.
src/Missings.jl
Outdated
struct PassMissing{F} <: Function end | ||
|
||
@generated (::PassMissing{F})(xs...;kw...) where {F} = | ||
:(any(ismissing, xs) || any(ismissing, values(values(kw))) ? missing : F(xs...; kw...)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, the inclusion of keyword arguments here seems off to me; what if missing
is a totally valid thing to pass as a keyword argument, but now it's interacting weird w/ PassMissing
because it always makes my result missing
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think my stab at this definition would be more along the lines of:
struct PassMissing{F} <: Function
f::F
end
function (f::PassMissing{F})(x) where {F}
if @generated
return x === Missing ? missing : :(f.f(x))
else
return x === missing ? missing : f.f(x)
end
end
function (f::PassMissing{F})(xs...; kw...) where {F}
if @generated
for T in xs
T === Missing && return missing
end
return :(f.f(xs...; kw...))
else
return any(ismissing, xs) ? missing : f.f(xs...; kw...)
end
end
with specialize methods for 1 argument, 2 argument, maybe 3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. Though with generated functions you don't need to specialize on the number of arguments.
Also why do you think keyword arguments should be excluded from the missingness check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the comments. I have done additional benchmarking and it looks you are right with the optimizations 👍 (I did some tests earlier that did allocate).
I do not understand how and why this if @generated
part works.
In particular e.g. why is it different from just writing:
@generated function (f::PassMissing{F})(x) where {F}
return x === Missing ? missing : :(f.f(x))
end
Could you please explain?
Regarding other issues:
- function name: I am open to anything (not a native)
- I was also unclear if we want to include keyword arguments as in general it is somewhat arbitrary what user sets as positional argument and what as keyword argument (especially as kwargs do not require to have default values).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The if @generated
part is just a future-proof against a time when you're able to statically compile a julia program into an executable that can run without LLVM (i.e. w/o runtime compilation). The else
block just provides a path that would be executed if the compiler encountered an already non-generated/compiled method for the given arguments at runtime.
For keyword arguments, it just feels more arbitrary. Like, it's hard to imagine a case where I'd be map
ing over some Vector{Union{Float64, Missing}}
and Vector{Union{Int, Missing}}
and be doing something like map((x, y)->round(x; digits=y), zip(A, B))
. i.e. when would I maybe pass a value vs. maybe pass a missing
as a keyword argument? Keyword arguments also tend to use various "sentinel" values as signals or special values for the function to use, including missing
; the danger there being that someone writes a function foo(x...; sentinel=nothing)
, but when the user tries to call Missings.propagate(foo)(x; sentinel=missing)
, the entire result comes back missing
instead of passing missing
on to foo
as a valid sentinel value.
If anything, I'd say we leave keyword arguments out of the mix for now since they can always be added later.
My main issue with passmissing
is that pass
usually has a connotation of "ignore" or "do nothing" (see python's pass), whereas we're not really ignoring missing
values, we're propagating them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For keyword arguments, it just feels more arbitrary. Like, it's hard to imagine a case where I'd be
map
ing over someVector{Union{Float64, Missing}}
andVector{Union{Int, Missing}}
and be doing something likemap((x, y)->round(x; digits=y), zip(A, B))
. i.e. when would I maybe pass a value vs. maybe pass amissing
as a keyword argument? Keyword arguments also tend to use various "sentinel" values as signals or special values for the function to use, includingmissing
; the danger there being that someone writes a functionfoo(x...; sentinel=nothing)
, but when the user tries to callMissings.propagate(foo)(x; sentinel=missing)
, the entire result comes backmissing
instead of passingmissing
on tofoo
as a valid sentinel value.
I don't think there are any examples of functions using missing
as a sentinel currently, right? We rather want people to use nothing
for that.
I see your point about keyword arguments generally being about options, while the data is passed as positional arguments, but I'd find it problematic to make this a rule. Overall it doesn't seem we have strong use cases for either behavior, and in such situations I tend to favor the simplest rule (i.e. "return missing
if one of the arguments is missing
").
If anything, I'd say we leave keyword arguments out of the mix for now since they can always be added later.
The problem is, changing that would be breaking. The only way to be able to choose any behavior later is to throw an error if a keyword argument is missing
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only way to be able to choose any behavior later is to throw an error if a keyword argument is missing.
We could define lift for functions that do not take keyword arguments for the time being. It is easy enough to wrap a function requiring kwargs in an anonymous function.
Did this ever make its way into master? Having a consistent way to "wrap" routines that might have to deal with missing values in DataFrames data cleaning would be pretty nice. Right now, the absence of missing-aware Date methods is kind of a pain... |
I was not clear what the best API would be so I left it hanging. @tcovert do you have any opinion what would be best given the discussion above? |
Having looked at the options I recommend:
@nalimilan I we accept this PR the side effect will be that DataFrames.jl will re-export this (which is actually nice I think) - I will add it to the documentation and tutorial. |
Codecov Report
@@ Coverage Diff @@
## master #89 +/- ##
===========================================
- Coverage 100% 88.46% -11.54%
===========================================
Files 1 1
Lines 38 52 +14
===========================================
+ Hits 38 46 +8
- Misses 0 6 +6
Continue to review full report at Codecov.
|
codecov gave a strange result (and the |
Sounds good for now. Can you document (and possibly test) that keyword arguments are not supported? |
Added. Thanks for looking at it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Do we want to merge it or wait for more feedback? |
There's been plenty of time for feedback, and the PR throws an error for keyword arguments, so we can discuss that later. |
This is a proposal of implementation of JuliaLang/julia#26661. Actually the best name for
conditional
would beifelse
but unfortunately it cannot be extended.