Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up interpolation codes using EllipsisNotation #214

Open
ChrisRackauckas opened this issue Oct 20, 2017 · 23 comments
Open

Clean up interpolation codes using EllipsisNotation #214

ChrisRackauckas opened this issue Oct 20, 2017 · 23 comments

Comments

@ChrisRackauckas
Copy link
Member

Right now the default for idxs is nothing. Then in all of the interpolation codes, there needs to be a switch between A[idxs] and A. But if we use EllipsisNotation.jl's A[..], that is the identity so it compiles to A, which means we can just default to idxs = .. and get rid of a ton of the interpolation code.

@devmotion
Copy link
Member

Afterwards we can apply the same change in DelayDiffEq to the history function.

@ranocha
Copy link
Member

ranocha commented Oct 20, 2017

I think it is a good idea to clean up the interpolations and use EllipsisNotation.jl's ... However, we should think about the performance if broadcasting is used.

using EllipsisNotation, BenchmarkTools

function foo!(A, B, C, D, E, F)
  @. A = B + 2C + 5D + E^2 + F^3
end

function bar!(A, B, C, D, E, F)
  @. A[..] = B[..] + 2C[..] + 5D[..] + E[..]^2 + F[..]^3
end

function foobar!(A, B, C, D, E, F)
  @. @views A[..] = B[..] + 2C[..] + 5D[..] + E[..]^2 + F[..]^3
end

N = 10^2; A=rand(N,N); B=rand(N,N); C=rand(N,N); D=rand(N,N); E=rand(N,N); F = rand(N,N);

gives on my machine

julia> @benchmark foo!(A, B, C, D, E, F)
BenchmarkTools.Trial: 
  memory estimate:  224 bytes
  allocs estimate:  4
  --------------
  minimum time:     38.356 μs (0.00% GC)
  median time:      40.373 μs (0.00% GC)
  mean time:        40.360 μs (0.00% GC)
  maximum time:     58.256 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark bar!(A, B, C, D, E, F)
BenchmarkTools.Trial: 
  memory estimate:  391.22 KiB
  allocs estimate:  21
  --------------
  minimum time:     83.906 μs (0.00% GC)
  median time:      94.716 μs (0.00% GC)
  mean time:        105.742 μs (9.49% GC)
  maximum time:     860.058 μs (83.45% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark foobar!(A, B, C, D, E, F)
BenchmarkTools.Trial: 
  memory estimate:  240 bytes
  allocs estimate:  5
  --------------
  minimum time:     79.951 μs (0.00% GC)
  median time:      80.943 μs (0.00% GC)
  mean time:        81.167 μs (0.00% GC)
  maximum time:     139.661 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

Just another question: Do we really need all the checks if out == nothing in the interpolants? Or should the possibility out == nothing be excluded?

@ChrisRackauckas
Copy link
Member Author

We should just add a fast path that avoids the to_indices method then. Try adding:

Base.getindex(A,::Val{:..}) = A

and see if that makes it free. (to make to_indices really free we may need to make it @generated`)

@devmotion
Copy link
Member

Regarding out == nothing: currently it cannot be excluded; with DelayDiffEq I discovered errors for interpolants which excluded that case. However, it is indeed unpleasant to always consider all these different cases.

I just had a look at these interpolants again, and I remember that errors were caused by
https://github.com/JuliaDiffEq/OrdinaryDiffEq.jl/blob/master/src/dense/generic_dense.jl#L326
and its call to ode_interpolant!. However, I just thought maybe it is better to call ode_interpolant instead (at least it seems more reasonable) - do any of you know why it was not that way in the first place? Is there any advantage of calling ode_interpolant! instead of ode_interpolant? I assume this change would prevent most of the calls of ode_interpolant! with out == nothing but I do not know whether it catches all of them.

@ChrisRackauckas
Copy link
Member Author

I just had a look at these interpolants again, and I remember that errors were caused by
https://github.com/JuliaDiffEq/OrdinaryDiffEq.jl/blob/master/src/dense/generic_dense.jl#L326
and its call to ode_interpolant!. However, I just thought maybe it is better to call ode_interpolant instead (at least it seems more reasonable) - do any of you know why it was not that way in the first place? Is there any advantage of calling ode_interpolant! instead of ode_interpolant? I assume this change would prevent most of the calls of ode_interpolant! with out == nothing but I do not know whether it catches all of them.

That sounds like it should work.

@ranocha
Copy link
Member

ranocha commented Oct 20, 2017

I've tried adding @inline Base.getindex(A, ::Val{:..}) = A to EllipsisNotation.jl and ran the following code

function f_loop!(A, B, C, D, E, F)
    @boundscheck begin
        size(A) == size(B) == size(C) == size(D) == size(E) == size(F)
    end
    @inbounds @simd for idx in eachindex(A)
        A[idx] = B[idx] + 2C[idx] + 5D[idx] + E[idx]^2 + F[idx]^3
    end
    nothing
end

function f_loop!(A, B, C, D, E)
    @boundscheck begin
        size(A) == size(B) == size(C) == size(D) == size(E)
    end
    @inbounds @simd for idx in eachindex(A)
        A[idx] = B[idx] + 2C[idx] + 5D[idx] + E[idx]^2
    end
    nothing
end

function f_noindex!(A, B, C, D, E, F)
    @. A = B + 2C + 5D + E^2 + F^3
    nothing
end

function f_noindex!(A, B, C, D, E)
    @. A = B + 2C + 5D + E^2
    nothing
end

function f_index!(A, B, C, D, E, F)
    @. A[..] = B[..] + 2C[..] + 5D[..] + E[..]^2 + F[..]^3
    nothing
end

function f_index!(A, B, C, D, E)
    @. A[..] = B[..] + 2C[..] + 5D[..] + E[..]^2
    nothing
end

function f_semiindex!(A, B, C, D, E, F)
    @. A = B[..] + 2C[..] + 5D[..] + E[..]^2 + F[..]^3
    nothing
end

function f_semiindex!(A, B, C, D, E)
    @. A = B[..] + 2C[..] + 5D[..] + E[..]^2
    nothing
end

function f_view!(A, B, C, D, E, F)
    @. @views A[..] = B[..] + 2C[..] + 5D[..] + E[..]^2 + F[..]^3
    nothing
end

function f_view!(A, B, C, D, E)
    @. @views A[..] = B[..] + 2C[..] + 5D[..] + E[..]^2
    nothing
end

N = 10^2; A=rand(N,N); B=rand(N,N); C=rand(N,N); D=rand(N,N); E=rand(N,N); F = rand(N,N);

@benchmark f_loop!(A, B, C, D, E, F)
@benchmark f_noindex!(A, B, C, D, E, F)
@benchmark f_index!(A, B, C, D, E, F)
@benchmark f_semiindex!(A, B, C, D, E, F)
@benchmark f_view!(A, B, C, D, E, F)

@benchmark f_loop!(A, B, C, D, E)
@benchmark f_noindex!(A, B, C, D, E)
@benchmark f_index!(A, B, C, D, E)
@benchmark f_semiindex!(A, B, C, D, E)
@benchmark f_view!(A, B, C, D, E)

This gives

julia> @benchmark f_loop!(A, B, C, D, E, F)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     9.758 μs (0.00% GC)
  median time:      9.852 μs (0.00% GC)
  mean time:        9.889 μs (0.00% GC)
  maximum time:     41.211 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark f_noindex!(A, B, C, D, E, F)
BenchmarkTools.Trial: 
  memory estimate:  224 bytes
  allocs estimate:  4
  --------------
  minimum time:     38.215 μs (0.00% GC)
  median time:      40.363 μs (0.00% GC)
  mean time:        40.367 μs (0.00% GC)
  maximum time:     55.525 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark f_index!(A, B, C, D, E, F)
BenchmarkTools.Trial: 
  memory estimate:  752 bytes
  allocs estimate:  30
  --------------
  minimum time:     50.090 μs (0.00% GC)
  median time:      50.295 μs (0.00% GC)
  mean time:        50.422 μs (0.00% GC)
  maximum time:     90.744 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark f_semiindex!(A, B, C, D, E, F)
BenchmarkTools.Trial: 
  memory estimate:  224 bytes
  allocs estimate:  4
  --------------
  minimum time:     38.280 μs (0.00% GC)
  median time:      40.192 μs (0.00% GC)
  mean time:        40.258 μs (0.00% GC)
  maximum time:     67.977 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark f_view!(A, B, C, D, E, F)
BenchmarkTools.Trial: 
  memory estimate:  192 bytes
  allocs estimate:  4
  --------------
  minimum time:     68.416 μs (0.00% GC)
  median time:      69.723 μs (0.00% GC)
  mean time:        70.716 μs (0.00% GC)
  maximum time:     124.193 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

and

julia> @benchmark f_loop!(A, B, C, D, E)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     7.585 μs (0.00% GC)
  median time:      7.698 μs (0.00% GC)
  mean time:        7.717 μs (0.00% GC)
  maximum time:     15.267 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     4

julia> @benchmark f_noindex!(A, B, C, D, E)
BenchmarkTools.Trial: 
  memory estimate:  176 bytes
  allocs estimate:  4
  --------------
  minimum time:     9.125 μs (0.00% GC)
  median time:      9.289 μs (0.00% GC)
  mean time:        9.322 μs (0.00% GC)
  maximum time:     41.743 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark f_index!(A, B, C, D, E)
BenchmarkTools.Trial: 
  memory estimate:  704 bytes
  allocs estimate:  30
  --------------
  minimum time:     26.109 μs (0.00% GC)
  median time:      26.317 μs (0.00% GC)
  mean time:        26.707 μs (0.00% GC)
  maximum time:     64.859 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark f_semiindex!(A, B, C, D, E)
BenchmarkTools.Trial: 
  memory estimate:  176 bytes
  allocs estimate:  4
  --------------
  minimum time:     9.153 μs (0.00% GC)
  median time:      9.338 μs (0.00% GC)
  mean time:        9.416 μs (0.00% GC)
  maximum time:     41.334 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark f_view!(A, B, C, D, E)
BenchmarkTools.Trial: 
  memory estimate:  144 bytes
  allocs estimate:  3
  --------------
  minimum time:     33.440 μs (0.00% GC)
  median time:      33.649 μs (0.00% GC)
  mean time:        33.803 μs (0.00% GC)
  maximum time:     68.334 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1
  1. f_semiindex! has the same performance as f_noindex!. What has to be defined in order to make f_index! tha same as f_semiindex!?
  2. f_loop! outperforms all other versions if at least six arguments are used. @ChrisRackauckas : You opened an issue for that, didn't you?

@ChrisRackauckas
Copy link
Member Author

f_semiindex! has the same performance as f_noindex!. What has to be defined in order to make f_index! tha same as f_semiindex!?

I think:

Base.setindex!(A,v,::Val{:..}) = (A = v)

f_loop! outperforms all other versions if at least six arguments are used. @ChrisRackauckas : You opened an issue for that, didn't you?

Yeah this seems to be another case of JuliaLang/julia#22255

@ChrisRackauckas
Copy link
Member Author

ChrisRackauckas commented Oct 20, 2017

I see what's going on with the to_indices code. Julia by default slices with [:,:,:], so that creates an extra copy. The override I posted avoids that, so technically it's a view which is itself. So it doesn't follow the normal semantics in this case, but that's okay with me. I can't think of another use for it.

@ranocha
Copy link
Member

ranocha commented Oct 20, 2017

Defining @inline Base.setindex!(A::AbstractArray, v, ::Val{:..}) = setindex!(A, v, :)does not help, the benchmarks are approximately the same. How can we avoid the extra copy induced by some indexing on the left hand side of =?

@ranocha
Copy link
Member

ranocha commented Oct 20, 2017

In the end, the copy for elements on the LHS does not matter for the interpolations, since out will be on the LHS without any indices. But I'm really interested in what's going on and how the performance can be increased.

@ChrisRackauckas
Copy link
Member Author

I think we need @mbauman's A.[..] change for it? I am not entirely sure.

@ranocha
Copy link
Member

ranocha commented Oct 20, 2017

I just ran the code again and got an error - I don't know why I didn't see it before. Changing the EllipsisNotation.jl code to

@inline Base.getindex(A::AbstractArray, ::Val{:..}) = A
@inline Base.setindex!(A::AbstractArray, v, ::Val{:..}) = setindex!(A, v, :)

results in the following timings

julia> @benchmark f_loop!(A, B, C, D, E)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     8.284 μs (0.00% GC)
  median time:      8.343 μs (0.00% GC)
  mean time:        8.350 μs (0.00% GC)
  maximum time:     17.075 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     3

julia> @benchmark f_noindex!(A, B, C, D, E)
BenchmarkTools.Trial: 
  memory estimate:  176 bytes
  allocs estimate:  4
  --------------
  minimum time:     9.212 μs (0.00% GC)
  median time:      9.352 μs (0.00% GC)
  mean time:        9.364 μs (0.00% GC)
  maximum time:     40.757 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark f_index!(A, B, C, D, E)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     17.467 μs (0.00% GC)
  median time:      17.517 μs (0.00% GC)
  mean time:        17.556 μs (0.00% GC)
  maximum time:     29.877 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark f_semiindex!(A, B, C, D, E)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     17.452 μs (0.00% GC)
  median time:      17.506 μs (0.00% GC)
  mean time:        17.540 μs (0.00% GC)
  maximum time:     40.670 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark f_view!(A, B, C, D, E)
BenchmarkTools.Trial: 
  memory estimate:  144 bytes
  allocs estimate:  3
  --------------
  minimum time:     33.303 μs (0.00% GC)
  median time:      33.513 μs (0.00% GC)
  mean time:        33.640 μs (0.00% GC)
  maximum time:     114.704 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

Thus, f_index! and f_semiindex!perform the same but f_noindex! is really better.

@ranocha
Copy link
Member

ranocha commented Oct 20, 2017

The timings for f_noindex! depend on the way the benchmark is performed. The above timings are in global scope without constdeclarations or use of $ for @benchmark. Using one or both of them makes the timings for f_noindex! the same as the other broadcasting ones.

What kind of benchmarking should be used? Does broadcasting really take more than twice the time as the loop? Does broadcasting use SIMD?

@ChrisRackauckas
Copy link
Member Author

What kind of benchmarking should be used?

This is good, with $ and @benchmark.

Does broadcasting use SIMD?

It should automatically use SIMD when it's necessary and Julia is set to -O3.

Does broadcasting really take more than twice the time as the loop?

No, @devmotion 's tests from before show that's not the case, but we're probably hitting JuliaLang/julia#22255 here. The indexing might count as operations here, lowering the number that's allowed even more...

@ranocha
Copy link
Member

ranocha commented Oct 20, 2017

Okay. Then I'll open a PR to EllipsisNotation.jl with @inline Base.getindex(A::AbstractArray, ::Val{:..}) = A. This is the only operation that is really necessary, defining setindex! does not have any influence on the timings with $.

@ranocha
Copy link
Member

ranocha commented Oct 20, 2017

What should we do with code such as for (j,i) in enumerate(idxs)? Just leave it there for the moment and check for idxs == .. instead of idxs == nothing or change to broadcasting?

@ranocha
Copy link
Member

ranocha commented Oct 20, 2017

Another important question: If idxs == .., not using views yields higher performance (see above). If idxs != .., views can yield more performance. What should we do?

@ChrisRackauckas
Copy link
Member Author

This case is a little weird because you give a preallocated vector of length 5 to get idxs = 1:2:11 you'll get an error because it'll be indexing to the higher values, so on the left you need to do something like 1:length(idxs) for the broadcast.

If idxs == .., not using views yields hifger performance (see above). If idxs != .., views can yield more performance. Waht should we do?

For this reason and because of the broadcast limit, we may want to wait for some compiler changes in Base before changing this.

@ranocha
Copy link
Member

ranocha commented Oct 20, 2017

So in the end, we shouldn't change the broadcasting and indexing with .. now, but it may be useful to have a look at ode_interpolant!(nothing, ...) and remove some of the branches, correct?

@ChrisRackauckas
Copy link
Member Author

Yeah, that should be able to remove some branches. Then when views are stack-allocated and the broadcast limit is removed we can collapse the others.

@ranocha
Copy link
Member

ranocha commented Oct 20, 2017

As @devmotion said,
https://github.com/JuliaDiffEq/OrdinaryDiffEq.jl/blob/19c347c7e8070a796768b42e507e3bf95ed227d4/src/dense/generic_dense.jl#L325-L326
seems to be the problematic part requiring the check if out == nothing. We can't simply replace ode_interpolant!(nothing, ...) by ode_interpolant(...), since the function will just call itself in this case.

Is ode_interpolant called only from one common place? If not, I can't think of anything simpler than inserting another function into the call chain; something as ode_interpolant does checks (for mutable and immutable caches) and calls ode_interpolant_implementation or ode_interpolant_implementation! afterwards. This does not seem to be optimal... Any better ideas?

@ChrisRackauckas
Copy link
Member Author

I think something like this has to be done. I'm not sure about the naming though.

@devmotion
Copy link
Member

I conducted some benchmarks on functions of the form of our interpolation functions with different number of polynomial coefficients (neglecting the constant amount of time needed for calculation of those coefficients): https://gist.github.com/devmotion/0866639f85ffb305e0485f00ae7f4025

Benchmarking was done on 0.7-rc2 and compared implementations with loops and broadcasts as well as no indexing and indexing with arrays and ellipsis notation (using the master branch of EllipsisNotation).

I haven't checked all results in detail but it seems without indexing the broadcast implementation is only a bit slower than the implementation with loops even for 15 coefficients. As expected, results with ellipsis notation are similar. Indexing with views adds a significant overhead of approximately a factor of 2 compared to an implementation with loops, it seems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants