-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster implementations #1
Comments
What does
Thank you for noting down all these tips! |
Most processors today are able to perform operations on a chunk of data, often 8 units (4 float64s), and this macro mainly tells the compiler that it is all right to convert the code to use these "Single Instruction Multiple Data" instructions. I am not sure how experimental this actually is, and it's easy to remove if they for some reason change the syntax. It was actually also possible to get a substantial speedup without function prox3!(f::NormL1{Float64}, x::RealOrComplexArray, gamma::Float64=1.0)
gf = gamma*f.lambda
@inbounds for i in 1:4:(floor(Int64,length(x)/4)*4)
x[i] += (x[i] <= -gf ? gf : (x[i] >= gf ? -gf : -x[i]))
x[i+1] += (x[i+1] <= -gf ? gf : (x[i+1] >= gf ? -gf : -x[i+1]))
x[i+2] += (x[i+2] <= -gf ? gf : (x[i+2] >= gf ? -gf : -x[i+2]))
x[i+3] += (x[i+3] <= -gf ? gf : (x[i+3] >= gf ? -gf : -x[i+3]))
end
@inbounds for i in (floor(Int64,length(x)/4)*4+1):length(x)
x[i] += (x[i] <= -gf ? gf : (x[i] >= gf ? -gf : -x[i]))
end
return x
end but I think |
After reading some more about simd (https://software.intel.com/en-us/articles/vectorization-in-julia) I was able to speed it up a bit more and added the example to the original post. |
As we discussed, there are several ways to speed up the implementations of the prox operators. I did some tests that may be useful in rewriting them.
These are the results for the NormL1Prox, for 100 prox operations on a vector of size 1000000.
Standard implementation
In-place update to avoid big allocations:
Avoid bounds-checks and use simd:
Avoid function calls and all but one multiplication:
Use ifelse to ensure that the compiler is not worried about taking both branches
Results:
prox: 1.557599 seconds (1.70 k allocations: 3.725 GB, 8.86% gc time)
prox1!: 0.235333 seconds (100 allocations: 1.563 KB) (6.5x speedup)
prox2!: 0.085790 seconds (100 allocations: 1.563 KB) (2.7 x speedup)
prox3!: 0.080576 seconds (100 allocations: 1.563 KB) (1.05x speedup) (total 19x speedup)
prox4!: 0.074251 seconds (100 allocations: 1.563 KB) (1.08x speedup) (total 21x speedup)
The text was updated successfully, but these errors were encountered: