You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Love the work done on this package. I had to do some data work in Pandas the other day, and ... nothing against Pandas, it has its place, but it really reminded me how great the DataFrames package has become.
Anyway, I was wondering if this was a bug, or maybe there is an easy fix to improving the performance of this (common, at least for me) use case. I know it's not really fair since I'm exploiting prior knowledge about the column contents, but a 10x performance penalty seems steep.
using DataFrames, BenchmarkTools, Dates, StatsBase
functionmwedates()
#build the sample
dts =reduce(vcat, [[Date(2011,11,11) +Day(i) for j in1:10^4] for i in1:100])
mdts = dts |> Vector{Union{Date, Missing}}
id =reduce(vcat, [[j for j in1:10^4] for i in1:100])
df =DataFrame(date=dts, mdate = mdts, id=id)
#shuffle
df = df[randperm(10^6), :]
print("sort date and id: ")
@btimesort($df, [:date, :id])
print("sort date(with missings) and id: ")
@btimesort($df, [:mdate, :id])
print("work around performance: ")
@btimebegin$df.mdateconverted =$df.mdate |> Vector{Date}
sort($df, [:mdateconverted, :id])
endendmwedates()
Output:
sort date and id: 291.772 ms (65029 allocations: 92.79 MiB)
sort date(with missings) and id: 2.883 s (105417916 allocations: 1.66 GiB)
work around performance: 298.849 ms (65042 allocations: 108.05 MiB)
The text was updated successfully, but these errors were encountered:
Ah forgot to check back- Main is working great. My new timings:
sort date and id: 213.129 ms (65037 allocations: 92.79 MiB)
sort date(with missings) and id: 255.404 ms (65036 allocations: 92.79 MiB)
work around performance: 209.793 ms (65050 allocations: 108.05 MiB)
Love the 30% improvement in the baseline on top of the 10x improvement in the test case.
Love the work done on this package. I had to do some data work in Pandas the other day, and ... nothing against Pandas, it has its place, but it really reminded me how great the DataFrames package has become.
Anyway, I was wondering if this was a bug, or maybe there is an easy fix to improving the performance of this (common, at least for me) use case. I know it's not really fair since I'm exploiting prior knowledge about the column contents, but a 10x performance penalty seems steep.
Output:
The text was updated successfully, but these errors were encountered: