Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make sure LLVM knows our allocation alignment #48139

Open
Seelengrab opened this issue Jan 5, 2023 · 6 comments
Open

Make sure LLVM knows our allocation alignment #48139

Seelengrab opened this issue Jan 5, 2023 · 6 comments

Comments

@Seelengrab
Copy link
Contributor

From a discussion on slack with @gbaraldi and @haampie , we (think) that LLVM is not aware of us aligning arrays to 16 byte. This results in less efficient code in SIMDable loops over such arrays, using (in the case I observed) lots of unaligned loads & stores, even for properly SIMD aligned views into such arrays.

I don't have a small MWE at hand, sadly enough, but I'll try to get one. @gbaraldi mentioned that we should be able to teach LLVM about our alignment, hence this issue. What spurred the discussion was me trying to figure out why there were lots of vinsertps in the code, indicating lots of misaligned stuff.

@haampie
Copy link
Contributor

haampie commented Jan 5, 2023

I just tried this to manually align on 32 bits for avx2 (reinterpret + view too difficult for the compiler?)

function aligned_32_simd_loop_avx2(n)
  xs = zeros(UInt8, sizeof(Float64) * n + 3)
  p = UInt(pointer(xs))
  offset = (((p + 31) & -32) - p) ÷ 8
  ys = reinterpret(Float64, view(xs, 1+offset:(offset + n * sizeof(Float64))))

  @simd for i in Base.OneTo(n)
    @inbounds ys[i] += 2.0 # uses vmovupd not vmovapd
  end
  
  return ys
end

@Seelengrab
Copy link
Contributor Author

There's not even a need for reinterpret - having just the view and adding 0x2 also produces unaligned instructions:

.LBB0_8:                                # %vector.body
                                        # =>This Inner Loop Header: Depth=1
; ││ @ simdloop.jl:77 within `macro expansion` @ REPL[11]:8
; ││┌ @ int.jl:87 within `+`
	vpaddb	zmm1, zmm0, zmmword ptr [rdx + rsi - 192]
	vpaddb	zmm2, zmm0, zmmword ptr [rdx + rsi - 128]
	vpaddb	zmm3, zmm0, zmmword ptr [rdx + rsi - 64]
	vpaddb	zmm4, zmm0, zmmword ptr [rdx + rsi]
; ││└
; ││┌ @ subarray.jl:351 within `setindex!` @ array.jl:969
	vmovdqu64	zmmword ptr [rdx + rsi - 192], zmm1
	vmovdqu64	zmmword ptr [rdx + rsi - 128], zmm2
	vmovdqu64	zmmword ptr [rdx + rsi - 64], zmm3
	vmovdqu64	zmmword ptr [rdx + rsi], zmm4
; ││└
; ││ @ simdloop.jl:78 within `macro expansion`
; ││┌ @ int.jl:87 within `+`
	add	rsi, 256
	cmp	rdi, rsi
	jne	.LBB0_8

(vmovdqu64 instead of vmovdqa64)

@gbaraldi
Copy link
Member

gbaraldi commented Jan 5, 2023

Attempts at this #21959 and #22649

@chriselrod
Copy link
Contributor

chriselrod commented Apr 26, 2023

julia> x = rand(Float64, 120);

julia> pushfirst!(x,0.0);

julia> Int(pointer(x))%16
8

julia> reshape(x, (11,11)) |> typeof
Matrix{Float64} (alias for Array{Float64, 2})

julia> Int(pointer(reshape(x, (11,11))))%16
8

julia> x = rand(Float64, 122);

julia> popfirst!(x);

julia> Int(pointer(x))%16
8

julia> reshape(x, (11,11)) |> typeof
Matrix{Float64} (alias for Array{Float64, 2})

julia> Int(pointer(reshape(x, (11,11))))%16
8

We can't assume Arrays are 16-byte aligned.
pushfirst! and popfirst! both shift the base pointer.

Aligned loads/stores may be faster, but on recent architectures aligned vmovupd isn't any slower than vmovapd. The difference is that unaligned vmovupds are slower, while unaligned vmovapds crash your program.

@Seelengrab
Copy link
Contributor Author

Arrays are not the only objects with alignment. The shift on pushfirst!/popfirst! also depends on the eltype, and only emitting unaligned loads/stores for those that actually need to be unaligned seems more optimal. Do you have a benchmark for vmovupd vs. vmovapd at hand?

@chriselrod
Copy link
Contributor

chriselrod commented Apr 26, 2023

zmm vmovupd load
zmm vmovapd load
zmm vmovupd store
zmm vmovapd store
You can look up instructions here: uops info table

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants