Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow sorts in columns with Union{<:Any, missing} even if no missing values in the column #2745

Closed
clintonTE opened this issue May 2, 2021 · 5 comments · Fixed by #2746
Closed
Milestone

Comments

@clintonTE
Copy link

Love the work done on this package. I had to do some data work in Pandas the other day, and ... nothing against Pandas, it has its place, but it really reminded me how great the DataFrames package has become.

Anyway, I was wondering if this was a bug, or maybe there is an easy fix to improving the performance of this (common, at least for me) use case. I know it's not really fair since I'm exploiting prior knowledge about the column contents, but a 10x performance penalty seems steep.

using DataFrames, BenchmarkTools, Dates, StatsBase
function mwedates()
  #build the sample
  dts = reduce(vcat, [[Date(2011,11,11) + Day(i) for j in 1:10^4] for i in 1:100])
  mdts = dts |> Vector{Union{Date, Missing}}
  id = reduce(vcat, [[j for j in 1:10^4] for i in 1:100])
  df = DataFrame(date=dts, mdate = mdts, id=id)

  #shuffle
  df = df[randperm(10^6), :]


  print("sort date and id: ")
  @btime sort($df, [:date, :id])
  print("sort date(with missings) and id: ")
  @btime sort($df, [:mdate, :id])
  print("work around performance: ")
  @btime begin
    $df.mdateconverted = $df.mdate |> Vector{Date}
    sort($df, [:mdateconverted, :id])
  end
end

mwedates()

Output:

sort date and id:   291.772 ms (65029 allocations: 92.79 MiB)
sort date(with missings) and id:   2.883 s (105417916 allocations: 1.66 GiB)
work around performance:   298.849 ms (65042 allocations: 108.05 MiB)
@bkamins
Copy link
Member

bkamins commented May 2, 2021

Fixed in #2746

@bkamins bkamins added this to the patch milestone May 2, 2021
@clintonTE
Copy link
Author

Thank you!

@bkamins
Copy link
Member

bkamins commented May 31, 2021

Some tests of main branch would be appreciated :).

@clintonTE
Copy link
Author

Ah forgot to check back- Main is working great. My new timings:

sort date and id:   213.129 ms (65037 allocations: 92.79 MiB)
sort date(with missings) and id:   255.404 ms (65036 allocations: 92.79 MiB)
work around performance:   209.793 ms (65050 allocations: 108.05 MiB)

Love the 30% improvement in the baseline on top of the 10x improvement in the test case.

@bkamins
Copy link
Member

bkamins commented Jun 18, 2021

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants