-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make sure LLVM knows our allocation alignment #48139
Comments
I just tried this to manually align on 32 bits for avx2 (reinterpret + view too difficult for the compiler?) function aligned_32_simd_loop_avx2(n)
xs = zeros(UInt8, sizeof(Float64) * n + 3)
p = UInt(pointer(xs))
offset = (((p + 31) & -32) - p) ÷ 8
ys = reinterpret(Float64, view(xs, 1+offset:(offset + n * sizeof(Float64))))
@simd for i in Base.OneTo(n)
@inbounds ys[i] += 2.0 # uses vmovupd not vmovapd
end
return ys
end |
There's not even a need for .LBB0_8: # %vector.body
# =>This Inner Loop Header: Depth=1
; ││ @ simdloop.jl:77 within `macro expansion` @ REPL[11]:8
; ││┌ @ int.jl:87 within `+`
vpaddb zmm1, zmm0, zmmword ptr [rdx + rsi - 192]
vpaddb zmm2, zmm0, zmmword ptr [rdx + rsi - 128]
vpaddb zmm3, zmm0, zmmword ptr [rdx + rsi - 64]
vpaddb zmm4, zmm0, zmmword ptr [rdx + rsi]
; ││└
; ││┌ @ subarray.jl:351 within `setindex!` @ array.jl:969
vmovdqu64 zmmword ptr [rdx + rsi - 192], zmm1
vmovdqu64 zmmword ptr [rdx + rsi - 128], zmm2
vmovdqu64 zmmword ptr [rdx + rsi - 64], zmm3
vmovdqu64 zmmword ptr [rdx + rsi], zmm4
; ││└
; ││ @ simdloop.jl:78 within `macro expansion`
; ││┌ @ int.jl:87 within `+`
add rsi, 256
cmp rdi, rsi
jne .LBB0_8 ( |
julia> x = rand(Float64, 120);
julia> pushfirst!(x,0.0);
julia> Int(pointer(x))%16
8
julia> reshape(x, (11,11)) |> typeof
Matrix{Float64} (alias for Array{Float64, 2})
julia> Int(pointer(reshape(x, (11,11))))%16
8
julia> x = rand(Float64, 122);
julia> popfirst!(x);
julia> Int(pointer(x))%16
8
julia> reshape(x, (11,11)) |> typeof
Matrix{Float64} (alias for Array{Float64, 2})
julia> Int(pointer(reshape(x, (11,11))))%16
8 We can't assume Aligned loads/stores may be faster, but on recent architectures aligned |
Arrays are not the only objects with alignment. The shift on |
zmm vmovupd load |
From a discussion on slack with @gbaraldi and @haampie , we (think) that LLVM is not aware of us aligning arrays to 16 byte. This results in less efficient code in SIMDable loops over such arrays, using (in the case I observed) lots of unaligned loads & stores, even for properly SIMD aligned views into such arrays.
I don't have a small MWE at hand, sadly enough, but I'll try to get one. @gbaraldi mentioned that we should be able to teach LLVM about our alignment, hence this issue. What spurred the discussion was me trying to figure out why there were lots of
vinsertps
in the code, indicating lots of misaligned stuff.The text was updated successfully, but these errors were encountered: