Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sortperm! with initialized=true #47977

Closed
projekter opened this issue Dec 23, 2022 · 1 comment · Fixed by #47979
Closed

sortperm! with initialized=true #47977

projekter opened this issue Dec 23, 2022 · 1 comment · Fixed by #47979
Labels
docs This change adds or pertains to documentation sorting Put things in order

Comments

@projekter
Copy link

The documentation of sortperm! says:

Like sortperm, but accepts a preallocated index vector ix. If initialized is false (the default), ix is initialized to contain the values 1:length(v).

Maybe the innocent word "index" is already supposed to tell the reader exactly what to expect from this function. But it did not for me. I would even say that an index vector ix is any vector that contains nonnegative integers, since for any such ix we can find another vector x such that x[ix] is a valid indexing operation.

So here's my expectation: v can be put to a certain order, and usually this is what v[ix] should do. However, the fact that we can switch off initialization at all suggests that ix = 1:length(v) is a prerequisite for sortperm! to behave like this. And if ix contains other values, then these other values are put in the same order as 1:length(v) would have been put. Example:

julia> sortperm!([5,6,7,8], [4,3,2,1], initialized=true)
4-element Vector{Int64}:
 8
 7
 6
 5

The vector [4,3,2,1] can be sorted by reversing the order. So usually, sortperm! would reverse 1:4, but now it has to reverse 5:8.
If you check what the example does, you may indeed find that my expectations are fulfilled. However, just running it multiple times will show that you can really get anything:

julia> import StatsBase
julia> sort!(collect(StatsBase.countmap([sortperm!([5,6,7,8], [4,3,2,1], initialized=true) for _ in 1:100000])))
24-element Vector{Pair{Vector{Int64}, Int64}}:
 [5, 6, 7, 8] => 93038
 [5, 6, 8, 7] => 103
 [5, 7, 6, 8] => 940
 [5, 7, 8, 6] => 77
 [5, 8, 6, 7] => 61
 [5, 8, 7, 6] => 2447
 [6, 5, 7, 8] => 7
 [6, 5, 8, 7] => 17
 [6, 7, 5, 8] => 1
 [6, 7, 8, 5] => 30
 [6, 8, 5, 7] => 17
 [6, 8, 7, 5] => 2
 [7, 5, 6, 8] => 432
 [7, 5, 8, 6] => 14
 [7, 6, 5, 8] => 332
 [7, 6, 8, 5] => 13
 [7, 8, 5, 6] => 14
 [7, 8, 6, 5] => 5
 [8, 5, 6, 7] => 22
 [8, 5, 7, 6] => 1314
 [8, 6, 5, 7] => 9
 [8, 6, 7, 5] => 12
 [8, 7, 5, 6] => 184
 [8, 7, 6, 5] => 909

Indeed, there are 24 permutations of four distinct elements, and you can get all of them. Some are more likely than others... Well, quicksort is unstable, but for unique input vectors, I'd still expect unique outputs.

So, first lesson that should either go into the documentation or into a rewrite of the function: ix must be a subset of 1:length(v), else the results are nondeterministic.

Second question: What happens if ix is such a subset, but it is unordered?

julia> while true
    tmp = sortperm!([2,3,1,4], [4,3,2,1], initialized=true)
    tmp == [4,3,2,1] || break
end

This looks like it never terminates. So actually, the order in which the indices are initially stored is completely irrelevant - this should also go into the documentation, as it tells the user when initialization is actually not necessary.

So then the final usage instructions for initialized are: Use false whenever your ix is not a subset of 1:length(v), else the results will be nondeterministic. Use true whenever they are such a subset, in whatever order. In particular, sortperm! with initialized=true can be called multiple times in a row with different vectors (of the same size), but ix does not need to be re-initialized.

julia> versioninfo()
Julia Version 1.8.3
Commit 0434deb161 (2022-11-14 20:14 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 16 × 11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, tigerlake)
  Threads: 8 on 16 virtual cores
@projekter projekter changed the title Documentation: sortperm! with initialized=false Documentation: sortperm! with initialized=true Dec 23, 2022
@LilithHafner LilithHafner changed the title Documentation: sortperm! with initialized=true sortperm! with initialized=true Dec 23, 2022
@LilithHafner LilithHafner added docs This change adds or pertains to documentation sorting Put things in order labels Dec 23, 2022
@LilithHafner
Copy link
Member

You can also get a segfault from this

sortperm!(rand(Int, 100), rand(Int, 100), initialized=true);

The current docstring is

Like sortperm, but accepts a preallocated index vector or array ix with the same axes as A. If initialized is false (the default), ix is initialized to contain the values LinearIndices(A).

Which is a bit better but still does not fully explain initialized and its risks.

Perhaps we should remove the initialized option altogether (ignore it and always initialize). Behavior when initialized=true and ix is not a permutation of LinearIndices(v) is currently undefined (it is not documented anywhere and therefore not defined) and I don't think anyone depends it. The reason people use initialized=true is for performance, but the performance savings are negligible:

julia> for i in 1:12
           n = 2^i
           v = rand(n)
           ix = collect(eachindex(v))
           total_time = @belapsed sortperm!(copyto!($ix, eachindex($v)), $v, initialized=true) setup=(rand!($v))
           copy_time = @belapsed copyto!($ix, eachindex($v))
           @printf "%4i | %7.2g %7.2g %.1f%%\n" n total_time copy_time copy_time/(total_time-copy_time)*100
       end
 len | sort    init    possible savings from setting `initialized=true`
   2 | 5.3e-07 4.9e-09 0.9%
   4 | 5.4e-07 5.9e-09 1.1%
   8 | 5.7e-07 6.7e-09 1.2%
  16 | 8.1e-07 6.3e-09 0.8%
  32 | 1.2e-06 8.4e-09 0.7%
  64 | 2.1e-06 8.8e-09 0.4%
 128 | 3.9e-06 1.4e-08 0.4%
 256 | 7.8e-06 2.2e-08 0.3%
 512 | 1.6e-05 4.2e-08 0.3%
1024 | 3.3e-05 7.8e-08 0.2%
2048 | 7.1e-05 1.5e-07 0.2%
4096 | 0.00015 3.7e-07 0.2%

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs This change adds or pertains to documentation sorting Put things in order
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants