Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix type instability in sort for few columns case and fix issorted bug #2746

Merged
merged 11 commits into from
May 30, 2021

Conversation

bkamins
Copy link
Member

@bkamins bkamins commented May 2, 2021

Fixes #2745 by loop unrolling.

Probably is is possible to do it in a smarter way using metaprogramming (but maybe not) so I am marking it as a draft.

src/abstractdataframe/sort.jl Outdated Show resolved Hide resolved
src/abstractdataframe/sort.jl Outdated Show resolved Hide resolved
@bkamins
Copy link
Member Author

bkamins commented May 2, 2021

We have a bad design of sorting in general:

julia> df = DataFrame(x=rand(10^7), y=rand(10^7));

julia> @time sort(df, [:x, :y]);
  5.751711 seconds (1.03 M allocations: 1021.159 MiB, 4.52% gc time)

julia> @time sort(df, [:x, order(:y, rev=true)]);
 18.962380 seconds (695.47 M allocations: 11.345 GiB, 5.67% gc time)

I will push an update (and then the code will be simplified with recursion)

@bkamins
Copy link
Member Author

bkamins commented May 2, 2021

@nalimilan - this should be good to have a look at. The only issue is that in general sorting is expensive to compile. I will have to think how to reduce this cost (though maybe it is not easy to do).

@bkamins
Copy link
Member Author

bkamins commented May 3, 2021

Timings after this PR:

julia> df = DataFrame(x=rand(10^7), y=rand(10^7));

julia> @btime sort($df, [:x, :y]);
  5.501 s (1029745 allocations: 1021.14 MiB)

julia> @btime sort($df, [:x, order(:y, rev=true)]);
  5.426 s (1029758 allocations: 1021.14 MiB)

and

julia> function mwedates()
         #build the sample
         dts = reduce(vcat, [[Date(2011,11,11) + Day(i) for j in 1:10^4] for i in 1:100])
         mdts = dts |> Vector{Union{Date, Missing}}
         id = reduce(vcat, [[j for j in 1:10^4] for i in 1:100])
         df = DataFrame(date=dts, mdate = mdts, id=id)

         #shuffle
         df = df[randperm(10^6), :]

         print("sort date and id: ")
         @btime sort($df, [:date, :id])
         print("sort date(with missings) and id: ")
         @btime sort($df, [:mdate, :id])
         print("work around performance: ")
         @btime begin
           $df.mdateconverted = $df.mdate |> Vector{Date}
           sort($df, [:mdateconverted, :id])
         end
       end
mwedates (generic function with 1 method)

julia> mwedates()
sort date and id:   260.934 ms (64997 allocations: 92.79 MiB)
sort date(with missings) and id:   329.115 ms (64997 allocations: 92.79 MiB)
work around performance:   272.197 ms (65004 allocations: 108.05 MiB)

so all is OK (the penalty of missing has to be accepted I think as work-around uses knowledge of the data)

@bkamins bkamins marked this pull request as ready for review May 3, 2021 08:58
src/other/precompile.jl Outdated Show resolved Hide resolved
@clintonTE
Copy link

so all is OK (the penalty of missing has to be accepted I think as work-around uses knowledge of the data)

Yeah, that little overhead is fantastic. Thank you!

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I wonder how this went unnoticed for so long.

Maybe one way to limit the compilation cost would be to sort column-wise rather than row-wise (starting with the last column)? Not sure whether that would be fast. Anyway that would require a deeper refactoring so this PR is useful even if we later change the approach.

src/abstractdataframe/sort.jl Show resolved Hide resolved
src/abstractdataframe/sort.jl Outdated Show resolved Hide resolved
#
# If a user only specifies a few columns, the DataFrame
# contained in the DFPerm only contains those columns, and
# the permutation induced by this ordering is used to
# sort the original (presumably larger) DataFrame

struct DFPerm{O<:Union{Ordering, AbstractVector}, T<:Tuple} <: Ordering
struct DFPerm{O<:Union{Ordering, Tuple{Vararg{Ordering}}},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to make O always a Tuple{Vararg{Ordering}}? That would avoid the need for ord isa Ordering below.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that if you do sort(df) you want a single Ordering that is reused to avoid excessive compilation. I would assume that ord isa Ordering check should be optimized out by the compiler so it should have no performance penalty.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried simplifying it but I always ended up with compiler allocating.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get it. Why would wrapping the Ordering in a one-element tuple trigger more compilation or allocations?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah - one element Tuple is not a problem, but in this case we would anyway have to branch if the tuple is one element or matches length of the column vector.

What allocates is if we wanted to have one tuple where each element would hold a tuple consisting of order and column.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Anyway always wrapping Ordering in a tuple sounds simpler conceptually.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will change it if I can make work it fast.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the conclusion regarding this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to leave the union (as in the original design) as in the end the logic was simpler (through dispatch).

Co-authored-by: Milan Bouchet-Valat <[email protected]>
@bkamins
Copy link
Member Author

bkamins commented May 3, 2021

Not sure whether that would be fast.

This would be slower AFAICT, because now we are short circuting (so mostly only one comparison is needed on the first column in typical cases)

@bkamins
Copy link
Member Author

bkamins commented May 3, 2021

We cannot use recursion, as it will break on very wide data frames. I will fix it tomorrow.

@nalimilan
Copy link
Member

This would be slower AFAICT, because now we are short circuting (so mostly only one comparison is needed on the first column in typical cases)

Good point. But I guess that depends on the types of sorted columns. When sorting on bitstype columns which can use optimized algorithms, I imagine that sorting one column at a time could be faster. I'm thinking about integer columns with small ranges which use counting sort in Base, floating point columns which use a special quick sort IIRC, or other types which could use radix sort (only via SortingAlgorithms currently). But we would have to be very careful to do that only for cases known to be fast anyway and that's tricky to get right.

@bkamins
Copy link
Member Author

bkamins commented May 4, 2021

When sorting on bitstype columns which can use optimized algorithms

I agree, but currently we have:

julia> using DataFrames

julia> using BenchmarkTools

julia> df = DataFrame(rand(1:10, 10^6, 2), :auto);

julia> @btime sort($df, :x1);
  82.997 ms (239220 allocations: 182.06 MiB)

julia> @btime sort($df, [:x1, :x2]);
  126.545 ms (239142 allocations: 170.64 MiB)

julia> @btime sort($df.x1);
  2.694 ms (3 allocations: 7.63 MiB)

julia> t = collect(zip(df.x1, df.x2)); @btime sort($t);
  65.424 ms (4 allocations: 22.89 MiB)

so I think that the first step should be to make sorting on single column fast (which we do at some point - in general sorting and reshaping are things to work on in the near future as these were the areas here no new things were added for a long time). But for this PR I would concentrate the design on fixing type instability issues.

@bkamins
Copy link
Member Author

bkamins commented May 4, 2021

Here are the benchmarks after the fix of recursion.
The conclusion is - in general we are faster, unless someone is sorting a very wide table on all columns (or in general sorting on very many columns - in this case we are slower although I use essentially the same code as previously - so it seems the compiler is not able to optimize things correctly in this case yet; I could fix it, but I think it is not worth it as sorting on super many columns is not very useful anyway).

Additionally I have discovered a bug in issorted that is fixed now.

this PR

julia> using DataFrames, Random, StatsBase, Dates, BenchmarkTools

julia> Random.seed!(1234)
MersenneTwister(1234)

julia> df = DataFrame(x=rand(10^6), y=rand(10^6));

julia> @time sort(df, [:x, :y]);
  1.554144 seconds (3.50 M allocations: 274.667 MiB, 3.56% gc time, 80.51% compilation time)

julia> @btime sort($df, [:x, :y]);
  275.927 ms (64998 allocations: 84.21 MiB)

julia> @time sort(df, [:x, order(:y, rev=true)]);
  0.823464 seconds (769.23 k allocations: 123.305 MiB, 1.69% gc time, 65.93% compilation time)

julia> @btime sort($df, [:x, order(:y, rev=true)]);
  279.058 ms (65005 allocations: 84.21 MiB)

julia> function mwedates()
                #build the sample
                dts = reduce(vcat, [[Date(2011,11,11) + Day(i) for j in 1:10^4] for i in 1:100])
                mdts = dts |> Vector{Union{Date, Missing}}
                id = reduce(vcat, [[j for j in 1:10^4] for i in 1:100])
                df = DataFrame(date=dts, mdate = mdts, id=id)

                #shuffle
                df = df[randperm(10^6), :]

                print("sort date and id: ")
                @btime sort($df, [:date, :id])
                print("sort date(with missings) and id: ")
                @btime sort($df, [:mdate, :id])
                print("work around performance: ")
                @btime begin
                  $df.mdateconverted = $df.mdate |> Vector{Date}
                  sort($df, [:mdateconverted, :id])
                end
              end
mwedates (generic function with 1 method)

julia> mwedates();
sort date and id:   276.888 ms (64997 allocations: 92.79 MiB)
sort date(with missings) and id:   345.657 ms (64997 allocations: 92.79 MiB)
work around performance:   279.926 ms (65005 allocations: 108.05 MiB)

julia> df = DataFrame(ones(10,1000), :auto);

julia> @time sort(df);
  0.603286 seconds (156.87 k allocations: 9.166 MiB, 99.74% compilation time)

julia> @btime sort(df);
  364.711 μs (6400 allocations: 322.86 KiB)

julia> df = DataFrame(ones(Int, 10000, 100), :auto);

julia> @time sort(df);
  0.502271 seconds (403.55 k allocations: 29.421 MiB, 1.04% gc time, 95.94% compilation time)

julia> @btime sort(df);
  15.511 ms (326 allocations: 7.72 MiB)

julia> df = DataFrame(ones(Bool, 100000, 15), :auto);

julia> @time sort(df);
  0.607865 seconds (735.55 k allocations: 43.240 MiB, 99.39% compilation time)

julia> @btime sort(df);
  3.011 ms (71 allocations: 2.20 MiB)

current main

julia> using DataFrames, Random, StatsBase, Dates, BenchmarkTools

julia> Random.seed!(1234)
MersenneTwister(1234)

julia> df = DataFrame(x=rand(10^6), y=rand(10^6));

julia> @time sort(df, [:x, :y]);
  1.624158 seconds (3.52 M allocations: 276.283 MiB, 7.45% gc time, 81.75% compilation time)

julia> @btime sort($df, [:x, :y]);
  272.999 ms (64998 allocations: 84.21 MiB)

julia> @time sort(df, [:x, order(:y, rev=true)]);
  1.499732 seconds (58.89 M allocations: 1008.693 MiB, 3.29% gc time, 33.07% compilation time)

julia> @btime sort($df, [:x, order(:y, rev=true)]);
  1.031 s (58215802 allocations: 971.52 MiB)

julia> function mwedates()
                #build the sample
                dts = reduce(vcat, [[Date(2011,11,11) + Day(i) for j in 1:10^4] for i in 1:100])
                mdts = dts |> Vector{Union{Date, Missing}}
                id = reduce(vcat, [[j for j in 1:10^4] for i in 1:100])
                df = DataFrame(date=dts, mdate = mdts, id=id)

                #shuffle
                df = df[randperm(10^6), :]

                print("sort date and id: ")
                @btime sort($df, [:date, :id])
                print("sort date(with missings) and id: ")
                @btime sort($df, [:mdate, :id])
                print("work around performance: ")
                @btime begin
                  $df.mdateconverted = $df.mdate |> Vector{Date}
                  sort($df, [:mdateconverted, :id])
                end
              end
mwedates (generic function with 1 method)

julia> mwedates();
sort date and id:   379.087 ms (64997 allocations: 92.79 MiB)
sort date(with missings) and id:   2.419 s (105424568 allocations: 1.66 GiB)
work around performance:   385.658 ms (65005 allocations: 108.05 MiB)

julia> df = DataFrame(ones(10,1000), :auto);

julia> @time sort(df);
  0.606793 seconds (152.38 k allocations: 9.070 MiB, 99.87% compilation time)

julia> @btime sort(df);
  258.662 μs (1999 allocations: 254.09 KiB)

julia> df = DataFrame(ones(Int, 10000, 100), :auto);

julia> @time sort(df);
  0.431671 seconds (419.29 k allocations: 30.190 MiB, 1.32% gc time, 98.73% compilation time)

julia> @btime sort(df);
  3.117 ms (326 allocations: 7.72 MiB)

julia> df = DataFrame(ones(Bool, 100000, 15), :auto);

julia> @time sort(df);
  0.452559 seconds (474.43 k allocations: 27.729 MiB, 1.04% gc time, 99.14% compilation time)

julia> @btime sort(df);
  3.281 ms (74 allocations: 2.20 MiB)

@bkamins bkamins changed the title Fix type instability in sort for few columns case Fix type instability in sort for few columns case and fix issorted bug May 4, 2021
@bkamins bkamins added the bug label May 4, 2021
@bkamins
Copy link
Member Author

bkamins commented May 16, 2021

@nalimilan - no rush, but it would be good to review it and merge, as it is fixing a bug in isordered.

function Sort.lt(o::DFPerm{<:Any, <:Tuple}, a, b)
ord = o.ord
cols = o.cols
length(cols) > 16 && return unstable_lt(ord, cols, a, b)
Copy link
Member

@nalimilan nalimilan May 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment explaining how the 16 threshold was chosen?

Suggested change
length(cols) > 16 && return unstable_lt(ord, cols, a, b)
# if there are too many columns fall back to type unstable mode to avoid high compilation cost
# it is expected that in practice users sort data frames on only few columns
length(cols) > 16 && return unstable_lt(ord, cols, a, b)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added. I have not tuned 16 specifically. I just assume that 16 is a safe threshold. Probably we could pass some higher number here, but I think that normally one does not sort on more than something like 4 columns.

NEWS.md Outdated Show resolved Hide resolved
Co-authored-by: Milan Bouchet-Valat <[email protected]>
@bkamins bkamins merged commit 4389c04 into main May 30, 2021
@bkamins bkamins deleted the bkamins-patch-1-1 branch May 30, 2021 21:36
@bkamins
Copy link
Member Author

bkamins commented May 30, 2021

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Slow sorts in columns with Union{<:Any, missing} even if no missing values in the column
3 participants