add lift function for working with missing values #26661

bkamins · 2018-03-30T10:59:56Z

Folows discussion in #26631 to add generic lifting method for any function.

The implementation lifts only over positional arguments. The three implemented methods follow results of benchmarking. In particular for a single positional argument it is faster not to call any function (at least currently). Also in lift(f, x; kw...) it is faster not to call lift(f) but directly perform the test. Some limitation is that in lift(f) approach we are not able to differentiate between single or multiple positional argument cases. Therefore lift(f, x; kw...) style will be faster for single positional argument case.

CC @nalimilan

martinholters · 2018-03-30T16:40:02Z

I find the term lift a bit too generic to be limited to missing. Not that I have good ideas for alternatives, though...

StefanKarpinski · 2018-03-30T18:05:30Z

How about calling it mift, short for "missing lift"?

Sacha0 · 2018-03-30T18:08:31Z

mift, the little-known alternative spelling of miffed, commonly used when irked by values missing from your dataset.

mbauman · 2018-03-30T18:26:01Z

I'm a little surprised that there's a perf difference there — typically those tuple functions are highly optimized and inline like crazy. You could work around the perf difference between lift(f, x) and lift(f)(x) with a function object:

struct Lifted{F} <: Function end
lift(f) = Lifted{f}()
(::Lifted{f})(x; kws...) where {f} = ismissing(x) ? missing : f(x; kws...)
(::Lifted{f})(xs...; kws...) where {f} = any(ismissing, xs) ? missing : f(xs...; kws...)

We could then always use the lift(f)(xs...) call syntax, and define lift(f, sentinel=missing) to store a sentinel value in the struct, making this extensible.

Just a spitball.

bkamins · 2018-03-30T19:13:28Z

@mbauman Nice, This is almost as fast as lift(f, x...; kw...) (around 20ns overhead) and other options are significantly slower. Also given your approach I agree that lift(f, x...; kw...) can be removed (although I liked its convenience).

Additionally after some analysis I have second thoughts about kw keyword arguments, as we cannot control if they are valid if x is missing, eg. lift(uppercase)(missing, foo="bar"). So maybe it is safer to remove this support (user can always use anonymous function to fix keyword argument).

So the approach could be something like (if I understand your idea correctly - I am extending the sentinel concept to also specify return value separately):

struct Lifted{F,S,V} <: Function end
lift(f, sentinel=missing, value=missing) = Lifted{f, sentinel, value}()
(::Lifted{F,S,V})(x) where {F,S,V} = isequal(x, S) ? V : F(x)
(::Lifted{F,S,V})(xs...) where {F,S,V} = any(x->isequal(x, S), xs) ? V : F(xs...)

which would give maximum flexibility.

nalimilan · 2018-03-31T13:13:14Z

Additionally after some analysis I have second thoughts about kw keyword arguments, as we cannot control if they are valid if x is missing, eg. lift(uppercase)(missing, foo="bar"). So maybe it is safer to remove this support (user can always use anonymous function to fix keyword argument).

In the same vein, it would be weird that lift(f)(missing, missing) would work when there is no two-argument method for f. It's probably not a serious problem in practice, since people will get an error when non-missing arguments are passed, but it's not totally satisfying either. Maybe all that is needed is to call Core.Compiler.return_type with Tuple{Any,Any} (for two arguments) and throw an error if it returns Union{}?

Regarding the extension to other sentinels than missing, it would make sense to consider coalesce at the same time. coalesce gets rid of both missing and nothing, so maybe we should do the same here by default, i.e. return either missing or nothing if one of the inputs happens to be equal to one of these?

bkamins · 2018-03-31T21:23:52Z

Using Core.Compiler.return_type should be OK (it has low overhead).
But keyword arguments will not be handled this way - right?

As for coalesce - then if we wanted to be consistent with it we should unwrap Some.

In general in the design of lift (or some other better name) I would prefer speed and simplicity over flexibility. The reason is that we need this function to provide Missing a proper lifting as this is an expected functionality and those will be 99% of use-cases. Not sure (but the difference is that coalesce can be efficient thanks to dispatch and here we will have to do testing inside a function).

nalimilan · 2018-03-31T22:07:05Z

Using Core.Compiler.return_type should be OK (it has low overhead).
But keyword arguments will not be handled this way - right?
Right. I was just thinking about positional arguments. Support for keyword arguments can always be added later (or not) without breaking anything anyway.

As for coalesce - then if we wanted to be consistent with it we should unwrap Some.
Hmm, that's indeed something which would be interesting to support to make it easier to work with Union{Some{T},Nothing}. I guess a lifted function would unwrap Some arguments before passing them?

In general in the design of lift (or some other better name) I would prefer speed and simplicity over flexibility. The reason is that we need this function to provide Missing a proper lifting as this is an expected functionality and those will be 99% of use-cases. Not sure (but the difference is that coalesce can be efficient thanks to dispatch and here we will have to do testing inside a function).

I don't think there's a tradeoff between flexibility and speed, as it would be handled at compile time anyway (that's the case for return_type when the types are known). There is definitely a tradeoff between simplicity and flexibility, though, which is why we provide skipmissing even if that operation can also be achieved with Iterators.filter. Here the issue is that if we claim the term lift, it would be problematic to reserve it for missing when it would be equally useful for nothing. coalesce handles both and it doesn't seem to be too complex, so maybe we can also find a good solution here.

bkamins · 2018-03-31T22:42:03Z

OK - let us try this path and start with the specification:

should lift handle Some like coalesce (actually this could be nice as then we can pass missing or nothing by wrapping it in Some if there is a need to actually pass this value, but actually handling Some was my biggest performance issue with making lift similar to coalesce, but maybe there is none)?
I am not clear what should happen if missing and nothing is mixed in lift - when should missing and when nothing be returned?

quinnj · 2018-04-01T02:12:07Z

Do we have examples of why we need lifting? Is it really that useful? Anybody have a real use-case where they needed lifting? (honestly curious here, not just trying to bat this PR down)

bkamins · 2018-04-01T07:57:15Z

This was the original reason:

using CSV
s = "A\nx\ny\n\nv"
df = CSV.read(IOBuffer(s))
uppercase.(df[:A])

and the last line fails, because column A of df contains missing. Of course we could make uppercase function handle missing but then there will be always functions that are not designed to handle missing.

nalimilan · 2018-04-01T09:18:38Z

Yeah, it's needed for all cases where we don't provide lifted methods (see also #26631). Actually lift could be even more useful for nothing than for missing, especially when using Some, since in that case no methods are defined at all.

bkamins · 2018-04-01T22:09:40Z

@nalimilan This is an implementation working like coalesce. Additionally it supports lift(f, x) form. If this is the direction we agree on I will update the PR:

struct Lifted{F} <: Function end
lift(f) = Lifted{f}()
lift(f, x...) = lift(f)(x...)

function (::Lifted{F})(x::Missing) where {F}
    if Core.Compiler.return_type(F, Tuple{Any}) == Union{}
        throw(ArgumentError("wrong number of function arguments"))
    end
    missing
end

function (::Lifted{F})(x::Nothing) where {F}
    if Core.Compiler.return_type(F, Tuple{Any}) == Union{}
        throw(ArgumentError("wrong number of function arguments"))
    end
    nothing
end

(::Lifted{F})(x::Some) where {F} = F(coalesce(x))
(::Lifted{F})(x) where {F} = F(x)

function (::Lifted{F})(xs...) where {F}
    for x in xs
        if ismissing(x)
            if Core.Compiler.return_type(F, NTuple{length(xs), Any}) == Union{}
                throw(ArgumentError("wrong number of function arguments"))
            end
            return missing
        end
        if isa(x, Nothing)
            if Core.Compiler.return_type(F, NTuple{length(xs), Any}) == Union{}
                throw(ArgumentError("wrong number of function arguments"))
            end
            return nothing
        end
    end
    F(map(coalesce, xs)...)
end

quinnj · 2018-04-01T23:11:27Z

I kind of lean towards thinking we don't really need this. I think it's an easy thing to think we need, but if you really break down use-cases, I really wonder how much convenience this would actually provide. I mean, in @bkamins's example, you could do

uppercase.(collect(skipmissing(df[:A])))

or the map equivalent

map(x->ismissing(x) ? missing : uppercase(x), df[:A])

if you ultimately can just drop the missing entries anyway. Compared the suggested lifting functionality here of

lift(uppercase).(df[:A]))

(assuming that would work correctly w/ broadcasting and everything.

It just seems like the tone of some of the comments around this are like: "oh no, missing is un-useable because every function isn't defined to handle it!!" whereas that's really not the case. Personally, I've been using missing (or null) for close to a year now and haven't ever thought I needed something like this. (and it's not like I never work w/ string data or anything).

Not saying we shouldn't keep proving this out and brainstorming, but just wanted to throw out some of my thoughts.

nalimilan · 2018-04-02T09:54:12Z

@bkamins In principle that looks good, though I'm not sure about performance. I've just realized that hasmethod could be used instead of return_type, but I'm not sure whether it makes a difference. Unfortunately I think it currently involves a check on every call (#16422).

@quinnj I see your point. I guess the advantage would be more compelling if we supported the f?(x) syntax instead of lift(f)(x). In the end that's really just a matter of conciseness. We could add this to Missings.jl for a start and see how useful it turns out to be.

JeffBezanson · 2018-04-02T16:00:36Z

Using return_type for this would not be correct.

pdeffebach · 2018-08-16T12:38:04Z

uppercase.(collect(skipmissing(df[:A])))
map(x->ismissing(x) ? missing : uppercase(x), df[:A])

These two are not equivalent, since the first one gets rid of missing values.

One thing that would be nice in Julia is a view for missing values. Obviously now you can do

x = [1, 2, 3, missing]
t = view(x, .!ismissing.(x))

Now you can work with `t' as though it is an array while

Modifying x directly
Preserving the place of the missing values of x

The problem is that this view is expensive and the syntax is a bit clumsy for what I see as a pretty common operation. If I understand correctly, lift seems like a lazier way of doing this, right? Like the map solution above. That's fine but given R's easy propogation of missings more concise helper functions are appreciated

bkamins · 2018-08-28T09:44:22Z

I have been thinking about going back to defining lift again. My current idea is the following implementation:

struct Lifted{P,V1,V2} <: Function end
lift(predicate, v1, v2) = Lifted{predicate, v1, v2}()

(::Lifted{F,P,V})(xs...;kw...) where {F,P,V} = P(xs...; kw...) ? V(xs...; kw...) : F(xs...; kw...)

_liftmissing_predicate(xs...;kw...) = any(ismissing.(xs)) || any(ismissing.(values(values(kw))))
_liftmissing_value(xs...;kw...) = missing
liftmissing(f) = lift(_liftmissing_pred, _liftmissing_value, f)

The idea is that it is fully flexible lift with liftmissing defined for the most common case.
All three parameters of lift are functions for generality. Also in this case I think we could skip checking number of parameters as I would document lift as something like:

returns a function that applies v1 to its arguments if pred is true on them and otherwise it applies v2.

Maybe function names lift and liftmissing could be changed to something else I am reusing the names to keep consistency with this thread.

nalimilan · 2018-08-29T13:28:56Z

We now have both coalesce and something to replace missing and nothing (respectively) with another value. So it could make sense to also have two different functions for lifting (rather than a single general function).

I'm not sure whether "lift" is the best term, even if it's used in C#, Scala and Haskell. "Propagate" would be more evocative, but it's a bit long. "Carry" could be nice (carrymissing). Or "pass" (passmissing). Of course there's also the eternal question of whether operators like f?(x) could be added.

bkamins · 2018-08-29T17:18:41Z

I am also not fully convinced with lift that is why I have said that I leave function names only for consistency with the earlier thread.

Regarding nothing and missing - this is true. What I wanted to say however, is that actually this part:

struct Lifted{P,V1,V2} <: Function end
lift(predicate, v1, v2) = Lifted{predicate, v1, v2}()

(::Lifted{F,P,V})(xs...;kw...) where {F,P,V} = P(xs...; kw...) ? V(xs...; kw...) : F(xs...; kw...)

is much more general and can have different applications - as it transforms ternary operator into a higher order function. Then handling missing or nothing is only one of possible usages.

Note e.g. that the default implementation for liftmissing (and actually I like passmissing name) is very conservative: if missing is present anywhere a missing is returned. However, there are cases when you want to be sensitive only to some specific argument or arguments and ignore other (e.g. in some function missing could have some meaning that is used in its logic), then you would need this "base" function.

And of course it is easy enough to add default liftnothing (or passnothing) implementation similar to the missing case.

So in summary there are the following questions:

do we want it at all (as some people voiced we do not need it) - I would say we do - when working with tabular data you often need it for column transformations;
do we want to expose this underlying infrastructure transforming ternary operator into a higher order function to users or only use it internally; I would say it would be nice to have it exported; then the question is what would be a proper name (probably native speakers could comment)
do we want to have liftmissing exported - I would say yes and passmissing is a nice name for it;
do we want to have passnothing exported - I am not sure, as one of the key conceptual differences of nothing and missing is that missing should propagate and nothing should not (that is why we disallow printing nothing I guess and also many functions currently use nothing as sentinel in their arguments to indicate absence of some parameter) - but I am also not against it if we feel it would be useful to have it defined by default

Finally I would leave out discussion of f? construct aside as this is a separate topic in my opinion - first we should decide on the functionality.

davidanthoff · 2018-08-30T04:07:29Z

Could this start out in Missings.jl for a while to see how useful it is and then be moved into base/stdlib if it is a success?

bkamins · 2018-08-30T08:17:16Z

OK. I have proposed it here JuliaData/Missings.jl#89. Let us see how it works out.

bkamins · 2020-10-24T10:11:55Z

We have now passmissing for some time in Missings.jl and it passed the test of time I believe. So the question is - do we move the functionality from Missings.jl to Base, or close this issue and just keep it in Missings.jl (I think it is OK to keep it in Missings.jl).

bkamins · 2021-11-13T16:09:07Z

OK - closing this as having passmissing in Missings.jl (what we have now) is enough I think.

add lift function for working with missing values

93f775f

bkamins force-pushed the add_lift branch from 2b329a2 to 93f775f Compare March 30, 2018 13:01

nalimilan added the missing data Base.missing and related functionality label Mar 31, 2018

bkamins mentioned this pull request Apr 28, 2018

Coalesce for mixed nothing/missing: bug and design issue #26927

Closed

bkamins mentioned this pull request Jul 22, 2018

tmapreduce refactor using callables and manual inference & tmap! cleanup mohamed82008/KissThreading.jl#9

Merged

nalimilan mentioned this pull request Aug 15, 2018

Date and DateTime support for Missing is missing #28570

Closed

bkamins mentioned this pull request Aug 30, 2018

Add conditional and passmissing JuliaData/Missings.jl#89

Merged

stevengj mentioned this pull request Jan 11, 2019

convert methods between String and Symbol #30680

Closed

bkamins closed this Nov 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add lift function for working with missing values #26661

add lift function for working with missing values #26661

bkamins commented Mar 30, 2018

martinholters commented Mar 30, 2018

StefanKarpinski commented Mar 30, 2018

Sacha0 commented Mar 30, 2018

mbauman commented Mar 30, 2018 •

edited

Loading

bkamins commented Mar 30, 2018

nalimilan commented Mar 31, 2018

bkamins commented Mar 31, 2018

nalimilan commented Mar 31, 2018

bkamins commented Mar 31, 2018

quinnj commented Apr 1, 2018

bkamins commented Apr 1, 2018

nalimilan commented Apr 1, 2018

bkamins commented Apr 1, 2018

quinnj commented Apr 1, 2018

nalimilan commented Apr 2, 2018

JeffBezanson commented Apr 2, 2018

pdeffebach commented Aug 16, 2018 •

edited

Loading

bkamins commented Aug 28, 2018 •

edited

Loading

nalimilan commented Aug 29, 2018

bkamins commented Aug 29, 2018

davidanthoff commented Aug 30, 2018

bkamins commented Aug 30, 2018

bkamins commented Oct 24, 2020

bkamins commented Nov 13, 2021

add lift function for working with missing values #26661

add lift function for working with missing values #26661

Conversation

bkamins commented Mar 30, 2018

martinholters commented Mar 30, 2018

StefanKarpinski commented Mar 30, 2018

Sacha0 commented Mar 30, 2018

mbauman commented Mar 30, 2018 • edited Loading

bkamins commented Mar 30, 2018

nalimilan commented Mar 31, 2018

bkamins commented Mar 31, 2018

nalimilan commented Mar 31, 2018

bkamins commented Mar 31, 2018

quinnj commented Apr 1, 2018

bkamins commented Apr 1, 2018

nalimilan commented Apr 1, 2018

bkamins commented Apr 1, 2018

quinnj commented Apr 1, 2018

nalimilan commented Apr 2, 2018

JeffBezanson commented Apr 2, 2018

pdeffebach commented Aug 16, 2018 • edited Loading

bkamins commented Aug 28, 2018 • edited Loading

nalimilan commented Aug 29, 2018

bkamins commented Aug 29, 2018

davidanthoff commented Aug 30, 2018

bkamins commented Aug 30, 2018

bkamins commented Oct 24, 2020

bkamins commented Nov 13, 2021

mbauman commented Mar 30, 2018 •

edited

Loading

pdeffebach commented Aug 16, 2018 •

edited

Loading

bkamins commented Aug 28, 2018 •

edited

Loading