Slow sorts in columns with Union{<:Any, missing} even if no missing values in the column #2745

clintonTE · 2021-05-02T19:08:02Z

Love the work done on this package. I had to do some data work in Pandas the other day, and ... nothing against Pandas, it has its place, but it really reminded me how great the DataFrames package has become.

Anyway, I was wondering if this was a bug, or maybe there is an easy fix to improving the performance of this (common, at least for me) use case. I know it's not really fair since I'm exploiting prior knowledge about the column contents, but a 10x performance penalty seems steep.

using DataFrames, BenchmarkTools, Dates, StatsBase
function mwedates()
  #build the sample
  dts = reduce(vcat, [[Date(2011,11,11) + Day(i) for j in 1:10^4] for i in 1:100])
  mdts = dts |> Vector{Union{Date, Missing}}
  id = reduce(vcat, [[j for j in 1:10^4] for i in 1:100])
  df = DataFrame(date=dts, mdate = mdts, id=id)

  #shuffle
  df = df[randperm(10^6), :]


  print("sort date and id: ")
  @btime sort($df, [:date, :id])
  print("sort date(with missings) and id: ")
  @btime sort($df, [:mdate, :id])
  print("work around performance: ")
  @btime begin
    $df.mdateconverted = $df.mdate |> Vector{Date}
    sort($df, [:mdateconverted, :id])
  end
end

mwedates()

Output:

sort date and id:   291.772 ms (65029 allocations: 92.79 MiB)
sort date(with missings) and id:   2.883 s (105417916 allocations: 1.66 GiB)
work around performance:   298.849 ms (65042 allocations: 108.05 MiB)

The text was updated successfully, but these errors were encountered:

bkamins · 2021-05-02T19:44:40Z

Fixed in #2746

clintonTE · 2021-05-31T20:53:52Z

Thank you!

bkamins · 2021-05-31T20:55:08Z

Some tests of main branch would be appreciated :).

clintonTE · 2021-06-18T03:59:50Z

Ah forgot to check back- Main is working great. My new timings:

sort date and id:   213.129 ms (65037 allocations: 92.79 MiB)
sort date(with missings) and id:   255.404 ms (65036 allocations: 92.79 MiB)
work around performance:   209.793 ms (65050 allocations: 108.05 MiB)

Love the 30% improvement in the baseline on top of the 10x improvement in the test case.

bkamins · 2021-06-18T11:47:52Z

Thank you!

bkamins mentioned this issue May 2, 2021

Fix type instability in sort for few columns case and fix issorted bug #2746

Merged

bkamins added the performance label May 2, 2021

bkamins added this to the patch milestone May 2, 2021

bkamins closed this as completed in #2746 May 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow sorts in columns with Union{<:Any, missing} even if no missing values in the column #2745

Slow sorts in columns with Union{<:Any, missing} even if no missing values in the column #2745

clintonTE commented May 2, 2021

bkamins commented May 2, 2021

clintonTE commented May 31, 2021

bkamins commented May 31, 2021

clintonTE commented Jun 18, 2021

bkamins commented Jun 18, 2021

Slow sorts in columns with Union{<:Any, missing} even if no missing values in the column #2745

Slow sorts in columns with Union{<:Any, missing} even if no missing values in the column #2745

Comments

clintonTE commented May 2, 2021

bkamins commented May 2, 2021

clintonTE commented May 31, 2021

bkamins commented May 31, 2021

clintonTE commented Jun 18, 2021

bkamins commented Jun 18, 2021