sort missing data placement #2267

chris-b1 · 2020-05-22T11:57:29Z

Currently, missing is treated as the largest value in a sort. pandas has a na_position kwarg that lets you specify how missing data should be ordered, by default placing it last, regardless of ascending or descending sort.

Whether that default is correct (I like it, but could be familiarity) I do think having an option on sort/order to control missing placement would be nice.

DataFrames 0.21

julia> df = DataFrame(:a => [1.0, 9.0, 2.0, missing])
4×1 DataFrame
│ Row │ a        │
│     │ Float64? │
├─────┼──────────┤
│ 1   │ 1.0      │
│ 2   │ 9.0      │
│ 3   │ 2.0      │
│ 4   │ missing  │

julia> sort(df, :a)
4×1 DataFrame
│ Row │ a        │
│     │ Float64? │
├─────┼──────────┤
│ 1   │ 1.0      │
│ 2   │ 2.0      │
│ 3   │ 9.0      │
│ 4   │ missing  │

julia> sort(df, :a, rev=true)
4×1 DataFrame
│ Row │ a        │
│     │ Float64? │
├─────┼──────────┤
│ 1   │ missing  │
│ 2   │ 9.0      │
│ 3   │ 2.0      │
│ 4   │ 1.0      │

pandas

df = pd.DataFrame({'a': [1.0, 9.0, 2.0, np.NaN]})

df
Out[42]: 
     a
0  1.0
1  9.0
2  2.0
3  NaN

df.sort_values('a')
Out[43]: 
     a
0  1.0
2  2.0
1  9.0
3  NaN

df.sort_values('a', ascending=False)
Out[44]: 
     a
1  9.0
2  2.0
0  1.0
3  NaN

df.sort_values('a', ascending=False, na_position='first')
Out[45]: 
     a
3  NaN
1  9.0
2  2.0
0  1.0

The text was updated successfully, but these errors were encountered:

bkamins · 2020-05-22T11:59:59Z

@nalimilan - what do you think? We could add this, but it is against the design in Base.

chris-b1 · 2020-05-22T13:01:50Z

Got it, yeah not sure how hard this should be special cased. For what it's worth below is an isless comparer that got me to what I needed (sort with rev=true and missing at the end)

isless_missingsmallest(::Missing, ::Any) = true
isless_missingsmallest(::Any, ::Missing) = false
isless_missingsmallest(::Missing, ::Missing) = false
isless_missingsmallest(x, y) = isless(x, y)

julia> sort(df, :a; lt=isless_missingsmallest)
4×1 DataFrame
│ Row │ a        │
│     │ Float64? │
├─────┼──────────┤
│ 1   │ missing  │
│ 2   │ 1.0      │
│ 3   │ 2.0      │
│ 4   │ 9.0      │

julia> sort(df, :a; lt=isless_missingsmallest, rev=true)
4×1 DataFrame
│ Row │ a        │
│     │ Float64? │
├─────┼──────────┤
│ 1   │ 9.0      │
│ 2   │ 2.0      │
│ 3   │ 1.0      │
│ 4   │ missing  │

nalimilan · 2020-05-22T13:47:14Z

Do you have an actual use case for sorting missings first? I wonder whether it's worth adding more complexity to that function. Maybe Missings.jl could provide a custom isless function like the one you showed and groupby could allow passing it.

chris-b1 · 2020-05-22T14:03:44Z

The actual usecase is missings last, because I'm using a descending sort. This would be pretty typical for reporting-type tasks - e.g., "show sales by region, ranked from largest to smallest, with any region missing data at the bottom of the report"

Most of the time could probably work around by filling missings with 0s

nalimilan · 2020-05-23T14:32:19Z

Ah yes it's indeed unfortunate that missing values are sorted first when reversing order. I guess adding a keyword argument could be justified then. We could have missinglast::Bool=!rev. Though you wouldn't be able to set it when passing order as they would conflict.

nivupaiR · 2020-11-25T17:55:47Z

I think this feature is useful. How about an optional keyword in sort such as skipmissing=true?

bkamins · 2020-11-25T18:13:08Z

It sure can be added as an opt-in, with the current behaviour being the default. We just need to wrork out the details of the design.

russmack · 2023-05-15T02:24:14Z

How did this play out?

bkamins · 2023-05-15T06:41:36Z

As you can see it is marked for 1.x release. We did not get additional requests for this feature during the last three years, so it did not get a high priority.

The current solution for this requirement is:

julia> sort(df, rev=true, lt=(x, y) -> ismissing(x) ? true : ismissing(y) ? false : isless(x, y))
4×1 DataFrame
 Row │ a
     │ Float64?
─────┼───────────
   1 │       9.0
   2 │       2.0
   3 │       1.0
   4 │ missing

Essentially what we would need to add is a special variant of isless which considers missing as the smallest not as the largest.
The question is what should be the name of this function?

alonsoC1s · 2023-06-23T09:08:33Z

I think I might be able to implement this. Would we handle this via a keyword argument, or a new variant of isless?

bkamins · 2023-06-25T08:21:54Z

@alonsoC1s - I think adding a function would be better. Also, it probably makes sense to have it in Missings.jl.
The reason is that then it could be used in any sorting/permutation function in the whole ecosystem.

The question is what the name should be. The isless_missingsmallest seems too lengthy.

@nalimilan @LilithHafner - do you have opinions?

LilithHafner · 2023-06-25T17:03:12Z

It would be nice to have this compose nicely with other orderings (e.g. use lt=⊂ and missingsmallest=true). Perhaps missingsmallest could be a higher order function with

missingsmallest(f) = (x,y) -> ismissing(y) ? false : ismissing(x) ? true : f(x, y)

Then the name would be missingsmallest(isless).

If efficiency is important, we should make the missingsmallest option dispatchable, though. Using an opaque lt function rules out most optimizations. This could look like

struct MissingSmallest{O<:Ordering} <: Ordering
    o::O
end

lt(::MissingSmallest, ::Missing, ::Missing) = false
lt(::MissingSmallest, ::Any, ::Missing) = false
lt(::MissingSmallest, ::Missing, ::Any) = true
lt(ms::MissingSmallest, x::Any, y::Any) = lt(ms.o, x, y)

@static if isdefined(Base.Sort, :MissingOptimization) && isdefined(Base.Sort, :_sort!)
    function Base.Sort._sort!(v::AbstractVector, a::MissingOptimization, o::MissingSmallest, kw)
        # put missing at end
        Base.Sort._sort!(v, a.next, o.o, (kw...; hi=new_hi))
    end
    function Base.Sort._sort!(v::AbstractVector, a::MissingOptimization, o::ReverseOrdering{<:MissingSmallest}, kw)
        # put missing at beginning
        Base.Sort._sort!(v, a.next, ReverseOrdering(o.fwd.o), (kw...; lo=new_lo))
    end
end

If this is implemented as a new Ordering, that would require a keyword argument or users creating their own Orderings.

The users creating their own orderings approach can be implemented in Missings.jl

bkamins · 2023-06-25T17:13:22Z

Thank you for your nice ideas 😄.

If efficiency is important, we should make the missingsmallest option dispatchable, though.

I think efficiency is important.

Why do you think your approach with missingsmallest(f) would not be efficient? Because it introduces extra branching?

An approach with Ordering subtype indeed makes sense as an alternative. My questions would be:

why do you think it would be faster?
I would be a bit afraid that using Ordering would be hard for users to learn (but maybe I am wrong here; it just seems to me that using the missingsmallest(f) wrapper is simpler to grasp for a typical user)

LilithHafner · 2023-06-25T18:02:29Z

why do you think it would be faster?

With an opaque lt ordering (e.g. missingsmallest(f)), we can get a plenty fast comparison based sort, and I imagine a non-adaptive comparison based sorting algorithm would perform identically on lt=missingsmallest(isless) and lt=isless but the sneaky tricks to optimize all the special cases don't apply. EDIT: yes they do, see @bkamins's comment.

For example, sorting a Vector{Union{Missing, Int64}} in Forward order typically shifts the missing values to the end so that the remaining sorting is faster (either branch free or simply very well predicted branches, I haven't checked) and in this case we can also use countsort or radixsort to sort the remaining values which are all Int64. ~~With missingsmallest(f) we'd be stuck with vanilla ScratchQuickSort or InsertionSort.~~

I would be a bit afraid that using Ordering would be hard for users to learn (but maybe I am wrong here; it just seems to me that using the missingsmallest(f) wrapper is simpler to grasp for a typical user)

Yeah. Usability-wise, I think sort!(df, missingssmallest=true) >> sort!(df, lt=missingssmallest(isless) ≈> sort(df, ord=MissingSmallest(Forward)). However, if we don't add missingsmallest to base, then sort!(v::Vector, missingsmallest=true) will not be an option. I think it would be good to get an idea of weather or not folks want to support this option in Base before finalizing the design here.

bkamins · 2023-06-25T18:46:41Z

For example, sorting a Vector{Union{Missing, Int64}} in Forward order typically shifts the missing values to the end so that the remaining sorting is faster

We could add special methods that would detect that missingsmallest(f) is passed that would do the same tricks that we already do but just put missing in another place.

I think it would be good to get an idea of weather or not folks want to support this option in Base before finalizing the design here.

I have pinged #internals on Slack. Let us see if someone has an opinion.

I think that the preference will be not to add this extra kwarg. The issue is that, in general, it could conflict with lt or by passed by user (that could treat missing in some different non-standard way). I think passing lt=missingssmallest(isless) is more composable.

nalimilan · 2023-06-25T20:35:12Z

See also previous discussion at JuliaData/Missings.jl#142. There I had proposed lt=missingisless. lt=missingsmallest(isless) sounds a bit too complex for casual users, and "smallest" isn't really the vocabulary we use regarding sorting. Though maybe we could have both something like missingisless for convenience (possibly in Missings.jl) and something lower-level but more flexible and on which algorithms would dispatch.

LilithHafner · 2023-06-25T21:45:36Z

We could add special methods that would detect that missingsmallest(f) is passed that would do the same tricks that we already do but just put missing in another place.

Yep! Good idea, you're right. There should be no performance loss from using the lt approach.

lt=missingsmallest(isless) seems like the soundest approach internally (up to renaming), but I don't see a problem with an alias. I like that Reverse === ReverseOrdering(ForwardOrdering()).

Bikeshed: we could call the higher order function missingsfirst

jariji · 2023-06-25T22:45:06Z

Current feeling - missingsfirst seems more intuitive than missingsless under rev = true and equally intuitive under rev = false.

LilithHafner · 2023-06-25T22:57:13Z

welll... I was thinking that sort!(lt=missingsfirst(isless), rev=true) would put missing values at the end, which is the opposite of intuitive imo. We can't really have sort!(lt=X, rev=true) and sort!(lt=X, rev=false) put missing values in the same place because rev is applied by swapping the arguments to lt, so having the term first in X is probably a bad idea in retrospect. (unless someone can think of a way of implementing this that lets us have sort!(lt=missingsfirst(isless), rev=true) do the obvious thing of putting missing first and sorting everything else in reverse order)

bkamins · 2023-06-26T06:15:01Z

I was thinking that sort!(lt=missingsfirst(isless), rev=true) would put missing values at the end

I agree. I feel we need lt to define the standard order and rev=true to reverse it entirely.
As opposed to SQL I think in Julia we do not treat missing in a special way in general (except for selected packages, but this functionality should be general).

So the conclusion of what is to do would be to have in Missings.jl:

Define missingssmallest(fun)
Define const missingsless = missingssmallest(isless)
Add documentation and make a release
Add special implementations of sorting algorithms if they encounter lt=missingsless (this can be a separate PR from the PR implementing the functionality)

The only thing is what names should we use (and this is usually hard).

Are there better alternatives than missingssmallest and missingless (as commented missingsfirst is unfortunately not good as we are specifying order not location)?

jariji · 2023-06-26T06:23:00Z

missingsleast would be more suitable than missingssmallest. missingless is fine too, its only downside being that it could be interpreted as meaning "without missings".

nalimilan · 2023-06-26T08:04:48Z

Note that my suggestion was missingisless, not missingsless. :-)

Something that worries me if we support both lt=missingisless and lt=missingsmallest(isless) is that the two names are quite similar and people who don't get the subtlety of isless vs "smallest" could confuse them. Maybe we could merge them? With some tricks it should be possible to write lt=missingsmallest as a shorthand for lt=missingsmallest(isless).

Instead of "smallest", we could also say "lowest", as it's more consistent with "lower than"/"greater than".

jariji · 2023-06-26T17:57:43Z

Instead of "smallest", we could also say "lowest", as it's more consistent with "lower than"/"greater than".

"least" goes with "less".

alonsoC1s · 2023-06-28T13:43:30Z

I feel like this feature could be well recieved in the wider ecosystem, it might be worth it to propose it upstream (Missings.jl or perhaps even Base itself)

bkamins · 2023-06-28T15:07:44Z

@alonsoC1s - yes. This should be added in Missings.jl. I just keep the discussion here to have all things in one place. I understand you offered to implement it. If this is the case please feel free to open a PR in Missings.jl adding what is agreed. We still have to settle on the names, but these can be easily updated when we have the PR opened (and it would help to move the things forward). Thank you!

bkamins · 2023-07-15T07:14:45Z

@alonsoC1s - are you planning to make a PR fo Missings.jl with this? (if not I can make it to move things forward)

alonsoC1s · 2023-07-15T14:53:22Z

Sorry, it totally slipped my mind. I am planning on it, it will be done in a few days at most

alonsoC1s · 2023-07-19T21:02:41Z

I just opened PR JuliaData/Missings.jl#144 at Missings.jl as a draft PR to work out the naming. I followed the roadmap set by @bkamins (i.e implementing the partial order function solution).

My two cents on the naming: The missingsless function (isless but modified so missing is the smallest value) can be a bit hard to undestand and for some reason I found myself making typos even while writing simple tests and things like that. Perhaps something like missingslast would be intuitive?

bkamins · 2023-07-19T21:28:00Z

Thank you. So I close this issue, and we can move the discussion to Missings.jl

nalimilan · 2023-07-24T09:53:48Z

Let's continue the discussion on naming given that it has already been developed here. I don't really like missingsless as it could mean "missings less", i.e. without missing values, and it uses the plural while missingsmallest uses the singular. Why not use missingisless, which I find clearer?

My two cents on the naming: The missingsless function (isless but modified so missing is the smallest value) can be a bit hard to undestand and for some reason I found myself making typos even while writing simple tests and things like that. Perhaps something like missingslast would be intuitive?

I guess this criticism applies to missingisless as well? missingslast is problematic as with sort(x, lt=missingslast), they would end up first instead.

@jariji proposed missingsleast. Do you think that's better? I'm not convinced it would be easier to remember or type for users. Among the options there's also missinglowest.

bkamins · 2023-07-25T13:33:38Z

My ranking of preference missinglowest > missingleast > missingsisless. The reasoning is based on what I would find easiest to remember and type.

Now another idea would be to define:

missingsmallest(f=isless) = (x, y) -> ismissing(y) ? false : ismissing(x) ? true : f(x, y)

Then we would introduce just one name missingsmallest and user could call it by=missingsmallest() if by=missingsmallest(isless) is wanted.

nalimilan · 2023-07-28T20:29:04Z

Then we would introduce just one name missingsmallest and user could call it by=missingsmallest() if by=missingsmallest(isless) is wanted.

You mean lt=missingsmallest(), right?

Maybe we could even allow lt=missingsmallest (without parentheses), by having missingsmallest(x, y; lt=isless) = ismissing(y) ? false : ismissing(x) ? true : lt(x, y) and missingsmallest(; lt=isless) = (x, y) -> missingsmallest(x, y; lt=lt). Not sure whether it would be confusing.

bkamins · 2023-07-28T23:54:28Z

You mean lt=missingsmallest(), right?

Yes - this is what I meant. Sorry.

Maybe we could even allow lt=missingsmallest

We could but I think that then a simpler definition could be:

missingsmallest(x, y) = ismissing(y) ? false : ismissing(x) ? true : isless(x, y)
missingsmallest(lt) = (x, y) -> ismissing(y) ? false : ismissing(x) ? true : lt(x, y)

I am not sure which syntax would be better. lt=missinsmallest is appealing for consistency with lt=isless, but lt=missinsmallest() has a simpler design. What do other people involved think?

bkamins · 2023-09-03T11:04:38Z

Any more comments on these options? (so that we can have some decision and move forward with the implementation)

LilithHafner · 2023-09-03T14:30:55Z

I agree with your latest API proposal with missingsmallest as an "alias" for missingsmallest(isless). The API would be:

missingsmallest(lt)(x,y) === ismissing(y) ? false : ismissing(x) ? true : lt(x, y)
missingsmallest(x,y) === missingsmallest(isless)(x,y)

I support supporting bare lt=missingsmallest because I think using missingsmallest in conjunction with custom lt will be uncommon.

A simple implementation, if we don't want to dispatch to specialized algorithms is

missingsmallest(lt) = (x, y) -> ismissing(y) ? false : ismissing(x) ? true : lt(x, y)
missingsmallest(x, y) = missingsmallest(isless)(x, y)

A clever (too clever) for implementation that allows missingsmallest(isless) === missingsmallest and supports optimizations is

# this struct is internal and exists because we can't dispatch on anonymous functions.
struct MissingSmallest{T}
    lt::T
end
const missingsmallest = MissingSmallest(lt)
(ms::MissingSmallest)(x, y) = ismissing(y) ? false : ismissing(x) ? true : ms.lt(x, y)
(::MissingSmallest{::typeof(isless)})(lt) = MissingSmallest(lt)

# Interesting properties

@test missingsmallest === missingsmallest(isless)
@test missingsmallest === missingsmallest(isless)(isless)

# Optimizations

# TODO: upstream `_Lt` into Base.Order
_Lt(::typeof(isless)) = Forward
_Lt(lt) = Lt(lt)

@static if isdefined(Base.Sort, :MissingOptimization) && isdefined(Base.Sort, :_sort!)
    function Base.Sort._sort!(v::AbstractVector, a::MissingOptimization, o::Lt{<:MissingSmallest}, kw)
        # put missing at beginning
        Base.Sort._sort!(v, a.next, _Lt(o.lt.lt), (kw...; hi=new_hi))
    end
    function Base.Sort._sort!(v::AbstractVector, a::MissingOptimization, o::ReverseOrdering{<:Lt{<:MissingSmallest}}, kw)
        # put missing at end
        Base.Sort._sort!(v, a.next, ReverseOrdering(_Lt(o.fwd.lt.lt)), (kw...; lo=new_lo))
    end
end

Another, less clever but probably better, implementation is

# Implementation

# this struct is internal and exists because we can't dispatch on anonymous functions.
struct MissingSmallest{T}
    lt::T
end
missingsmallest(lt) = MissingSmallest(lt)
missingsmallest(x, y) = missingsmallest(lt)(x, y)
(ms::MissingSmallest)(x, y) = ismissing(y) ? false : ismissing(x) ? true : ms.lt(x, y)

# Interesting properties

@test_throws MethodError missingsmallest(isless)(isless)
@test missingsmallest !== missingsmallest(isless) # sad but necessary

# Optimizations

# TODO: upstream `_Lt` into Base.Order
_Lt(::typeof(isless)) = Forward
_Lt(lt) = Lt(lt)

withoutmissingordering(::typeof(missingsmallest)) = Forward
withoutmissingordering(ms::MissingSmallest) = _Lt(ms)
const _MissingSmallest = Union{MissingSmallest, typeof(missingsmallest}}

@static if isdefined(Base.Sort, :MissingOptimization) && isdefined(Base.Sort, :_sort!)
    function Base.Sort._sort!(v::AbstractVector, a::MissingOptimization, o::Lt{<:_MissingSmallest}, kw)
        # put missing at beginning
        Base.Sort._sort!(v, a.next, withoutmissingordering(o), (kw...; hi=new_hi))
    end
    function Base.Sort._sort!(v::AbstractVector, a::MissingOptimization, o::ReverseOrdering{Lt{<:_MissingSmallest}}, kw)
        # put missing at end
        Base.Sort._sort!(v, a.next, ReverseOrdering(withoutmissingordering(o.fwd)), (kw...; lo=new_lo))
    end
end

nalimilan · 2023-09-03T15:04:52Z

I'd be fine with either the simple implementation or the less clever one. Optimizations can be added later if we feel the need for them.

bkamins added decision feature labels May 22, 2020

bkamins added this to the 1.x milestone Nov 25, 2020

bkamins modified the milestones: 1.x, 1.6 May 15, 2023

LilithHafner mentioned this issue Jun 25, 2023

Should we add a keyword to sort missing values as small sort!(v, missingssmallest=true)? JuliaLang/julia#50288

Closed

bkamins modified the milestones: 1.6, 1.7 Jul 10, 2023

alonsoC1s mentioned this issue Jul 19, 2023

Implementing order functions where missings is the smallest value JuliaData/Missings.jl#144

Merged

bkamins closed this as completed Jul 19, 2023

sort missing data placement #2267

sort missing data placement #2267

Comments

chris-b1 commented May 22, 2020

bkamins commented May 22, 2020

chris-b1 commented May 22, 2020

nalimilan commented May 22, 2020

chris-b1 commented May 22, 2020

nalimilan commented May 23, 2020

nivupaiR commented Nov 25, 2020

bkamins commented Nov 25, 2020

russmack commented May 15, 2023

bkamins commented May 15, 2023

alonsoC1s commented Jun 23, 2023

bkamins commented Jun 25, 2023

LilithHafner commented Jun 25, 2023

bkamins commented Jun 25, 2023

LilithHafner commented Jun 25, 2023 • edited Loading

bkamins commented Jun 25, 2023 • edited Loading

nalimilan commented Jun 25, 2023

LilithHafner commented Jun 25, 2023

jariji commented Jun 25, 2023

LilithHafner commented Jun 25, 2023

bkamins commented Jun 26, 2023

jariji commented Jun 26, 2023

nalimilan commented Jun 26, 2023

jariji commented Jun 26, 2023

alonsoC1s commented Jun 28, 2023

bkamins commented Jun 28, 2023

bkamins commented Jul 15, 2023

alonsoC1s commented Jul 15, 2023

alonsoC1s commented Jul 19, 2023

bkamins commented Jul 19, 2023

nalimilan commented Jul 24, 2023

bkamins commented Jul 25, 2023

nalimilan commented Jul 28, 2023

bkamins commented Jul 28, 2023

bkamins commented Sep 3, 2023

LilithHafner commented Sep 3, 2023 • edited Loading

nalimilan commented Sep 3, 2023

LilithHafner commented Jun 25, 2023 •

edited

Loading

bkamins commented Jun 25, 2023 •

edited

Loading

LilithHafner commented Sep 3, 2023 •

edited

Loading