-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sort missing data placement #2267
Comments
@nalimilan - what do you think? We could add this, but it is against the design in Base. |
Got it, yeah not sure how hard this should be special cased. For what it's worth below is an isless comparer that got me to what I needed (sort with isless_missingsmallest(::Missing, ::Any) = true
isless_missingsmallest(::Any, ::Missing) = false
isless_missingsmallest(::Missing, ::Missing) = false
isless_missingsmallest(x, y) = isless(x, y)
julia> sort(df, :a; lt=isless_missingsmallest)
4×1 DataFrame
│ Row │ a │
│ │ Float64? │
├─────┼──────────┤
│ 1 │ missing │
│ 2 │ 1.0 │
│ 3 │ 2.0 │
│ 4 │ 9.0 │
julia> sort(df, :a; lt=isless_missingsmallest, rev=true)
4×1 DataFrame
│ Row │ a │
│ │ Float64? │
├─────┼──────────┤
│ 1 │ 9.0 │
│ 2 │ 2.0 │
│ 3 │ 1.0 │
│ 4 │ missing │ |
Do you have an actual use case for sorting missings first? I wonder whether it's worth adding more complexity to that function. Maybe Missings.jl could provide a custom |
The actual usecase is missings last, because I'm using a descending sort. This would be pretty typical for reporting-type tasks - e.g., "show sales by region, ranked from largest to smallest, with any region missing data at the bottom of the report" Most of the time could probably work around by filling missings with 0s |
Ah yes it's indeed unfortunate that missing values are sorted first when reversing order. I guess adding a keyword argument could be justified then. We could have |
I think this feature is useful. How about an optional keyword in sort such as skipmissing=true? |
It sure can be added as an opt-in, with the current behaviour being the default. We just need to wrork out the details of the design. |
How did this play out? |
As you can see it is marked for 1.x release. We did not get additional requests for this feature during the last three years, so it did not get a high priority. The current solution for this requirement is:
Essentially what we would need to add is a special variant of |
I think I might be able to implement this. Would we handle this via a keyword argument, or a new variant of |
@alonsoC1s - I think adding a function would be better. Also, it probably makes sense to have it in Missings.jl. The question is what the name should be. The @nalimilan @LilithHafner - do you have opinions? |
It would be nice to have this compose nicely with other orderings (e.g. use missingsmallest(f) = (x,y) -> ismissing(y) ? false : ismissing(x) ? true : f(x, y) Then the name would be If efficiency is important, we should make the missingsmallest option dispatchable, though. Using an opaque struct MissingSmallest{O<:Ordering} <: Ordering
o::O
end
lt(::MissingSmallest, ::Missing, ::Missing) = false
lt(::MissingSmallest, ::Any, ::Missing) = false
lt(::MissingSmallest, ::Missing, ::Any) = true
lt(ms::MissingSmallest, x::Any, y::Any) = lt(ms.o, x, y)
@static if isdefined(Base.Sort, :MissingOptimization) && isdefined(Base.Sort, :_sort!)
function Base.Sort._sort!(v::AbstractVector, a::MissingOptimization, o::MissingSmallest, kw)
# put missing at end
Base.Sort._sort!(v, a.next, o.o, (kw...; hi=new_hi))
end
function Base.Sort._sort!(v::AbstractVector, a::MissingOptimization, o::ReverseOrdering{<:MissingSmallest}, kw)
# put missing at beginning
Base.Sort._sort!(v, a.next, ReverseOrdering(o.fwd.o), (kw...; lo=new_lo))
end
end If this is implemented as a new The users creating their own orderings approach can be implemented in Missings.jl |
Thank you for your nice ideas 😄.
I think efficiency is important. Why do you think your approach with An approach with
|
With an opaque For example, sorting a
Yeah. Usability-wise, I think |
We could add special methods that would detect that
I have pinged #internals on Slack. Let us see if someone has an opinion. I think that the preference will be not to add this extra kwarg. The issue is that, in general, it could conflict with |
See also previous discussion at JuliaData/Missings.jl#142. There I had proposed |
Yep! Good idea, you're right. There should be no performance loss from using the
Bikeshed: we could call the higher order function |
Current feeling - |
welll... I was thinking that |
I agree. I feel we need So the conclusion of what is to do would be to have in Missings.jl:
The only thing is what names should we use (and this is usually hard). Are there better alternatives than |
|
Note that my suggestion was Something that worries me if we support both Instead of "smallest", we could also say "lowest", as it's more consistent with "lower than"/"greater than". |
"least" goes with "less". |
I feel like this feature could be well recieved in the wider ecosystem, it might be worth it to propose it upstream (Missings.jl or perhaps even Base itself) |
@alonsoC1s - yes. This should be added in Missings.jl. I just keep the discussion here to have all things in one place. I understand you offered to implement it. If this is the case please feel free to open a PR in Missings.jl adding what is agreed. We still have to settle on the names, but these can be easily updated when we have the PR opened (and it would help to move the things forward). Thank you! |
@alonsoC1s - are you planning to make a PR fo Missings.jl with this? (if not I can make it to move things forward) |
Sorry, it totally slipped my mind. I am planning on it, it will be done in a few days at most |
I just opened PR JuliaData/Missings.jl#144 at Missings.jl as a draft PR to work out the naming. I followed the roadmap set by @bkamins (i.e implementing the partial order function solution). My two cents on the naming: The |
Thank you. So I close this issue, and we can move the discussion to Missings.jl |
Let's continue the discussion on naming given that it has already been developed here. I don't really like
I guess this criticism applies to @jariji proposed |
My ranking of preference Now another idea would be to define:
Then we would introduce just one name |
You mean Maybe we could even allow |
Yes - this is what I meant. Sorry.
We could but I think that then a simpler definition could be:
I am not sure which syntax would be better. |
Any more comments on these options? (so that we can have some decision and move forward with the implementation) |
I agree with your latest API proposal with missingsmallest(lt)(x,y) === ismissing(y) ? false : ismissing(x) ? true : lt(x, y)
missingsmallest(x,y) === missingsmallest(isless)(x,y) I support supporting bare A simple implementation, if we don't want to dispatch to specialized algorithms is missingsmallest(lt) = (x, y) -> ismissing(y) ? false : ismissing(x) ? true : lt(x, y)
missingsmallest(x, y) = missingsmallest(isless)(x, y) A clever (too clever) for implementation that allows # this struct is internal and exists because we can't dispatch on anonymous functions.
struct MissingSmallest{T}
lt::T
end
const missingsmallest = MissingSmallest(lt)
(ms::MissingSmallest)(x, y) = ismissing(y) ? false : ismissing(x) ? true : ms.lt(x, y)
(::MissingSmallest{::typeof(isless)})(lt) = MissingSmallest(lt)
# Interesting properties
@test missingsmallest === missingsmallest(isless)
@test missingsmallest === missingsmallest(isless)(isless)
# Optimizations
# TODO: upstream `_Lt` into Base.Order
_Lt(::typeof(isless)) = Forward
_Lt(lt) = Lt(lt)
@static if isdefined(Base.Sort, :MissingOptimization) && isdefined(Base.Sort, :_sort!)
function Base.Sort._sort!(v::AbstractVector, a::MissingOptimization, o::Lt{<:MissingSmallest}, kw)
# put missing at beginning
Base.Sort._sort!(v, a.next, _Lt(o.lt.lt), (kw...; hi=new_hi))
end
function Base.Sort._sort!(v::AbstractVector, a::MissingOptimization, o::ReverseOrdering{<:Lt{<:MissingSmallest}}, kw)
# put missing at end
Base.Sort._sort!(v, a.next, ReverseOrdering(_Lt(o.fwd.lt.lt)), (kw...; lo=new_lo))
end
end Another, less clever but probably better, implementation is # Implementation
# this struct is internal and exists because we can't dispatch on anonymous functions.
struct MissingSmallest{T}
lt::T
end
missingsmallest(lt) = MissingSmallest(lt)
missingsmallest(x, y) = missingsmallest(lt)(x, y)
(ms::MissingSmallest)(x, y) = ismissing(y) ? false : ismissing(x) ? true : ms.lt(x, y)
# Interesting properties
@test_throws MethodError missingsmallest(isless)(isless)
@test missingsmallest !== missingsmallest(isless) # sad but necessary
# Optimizations
# TODO: upstream `_Lt` into Base.Order
_Lt(::typeof(isless)) = Forward
_Lt(lt) = Lt(lt)
withoutmissingordering(::typeof(missingsmallest)) = Forward
withoutmissingordering(ms::MissingSmallest) = _Lt(ms)
const _MissingSmallest = Union{MissingSmallest, typeof(missingsmallest}}
@static if isdefined(Base.Sort, :MissingOptimization) && isdefined(Base.Sort, :_sort!)
function Base.Sort._sort!(v::AbstractVector, a::MissingOptimization, o::Lt{<:_MissingSmallest}, kw)
# put missing at beginning
Base.Sort._sort!(v, a.next, withoutmissingordering(o), (kw...; hi=new_hi))
end
function Base.Sort._sort!(v::AbstractVector, a::MissingOptimization, o::ReverseOrdering{Lt{<:_MissingSmallest}}, kw)
# put missing at end
Base.Sort._sort!(v, a.next, ReverseOrdering(withoutmissingordering(o.fwd)), (kw...; lo=new_lo))
end
end |
I'd be fine with either the simple implementation or the less clever one. Optimizations can be added later if we feel the need for them. |
Currently,
missing
is treated as the largest value in a sort. pandas has ana_position
kwarg that lets you specify how missing data should be ordered, by default placing it last, regardless of ascending or descending sort.Whether that default is correct (I like it, but could be familiarity) I do think having an option on sort/order to control missing placement would be nice.
DataFrames
0.21
pandas
The text was updated successfully, but these errors were encountered: