Make sure LLVM knows our allocation alignment #48139

Seelengrab · 2023-01-05T15:52:20Z

From a discussion on slack with @gbaraldi and @haampie , we (think) that LLVM is not aware of us aligning arrays to 16 byte. This results in less efficient code in SIMDable loops over such arrays, using (in the case I observed) lots of unaligned loads & stores, even for properly SIMD aligned views into such arrays.

I don't have a small MWE at hand, sadly enough, but I'll try to get one. @gbaraldi mentioned that we should be able to teach LLVM about our alignment, hence this issue. What spurred the discussion was me trying to figure out why there were lots of vinsertps in the code, indicating lots of misaligned stuff.

The text was updated successfully, but these errors were encountered:

haampie · 2023-01-05T16:05:15Z

I just tried this to manually align on 32 bits for avx2 (reinterpret + view too difficult for the compiler?)

function aligned_32_simd_loop_avx2(n)
  xs = zeros(UInt8, sizeof(Float64) * n + 3)
  p = UInt(pointer(xs))
  offset = (((p + 31) & -32) - p) ÷ 8
  ys = reinterpret(Float64, view(xs, 1+offset:(offset + n * sizeof(Float64))))

  @simd for i in Base.OneTo(n)
    @inbounds ys[i] += 2.0 # uses vmovupd not vmovapd
  end
  
  return ys
end

Seelengrab · 2023-01-05T16:47:41Z

There's not even a need for reinterpret - having just the view and adding 0x2 also produces unaligned instructions:

.LBB0_8:                                # %vector.body
                                        # =>This Inner Loop Header: Depth=1
; ││ @ simdloop.jl:77 within `macro expansion` @ REPL[11]:8
; ││┌ @ int.jl:87 within `+`
	vpaddb	zmm1, zmm0, zmmword ptr [rdx + rsi - 192]
	vpaddb	zmm2, zmm0, zmmword ptr [rdx + rsi - 128]
	vpaddb	zmm3, zmm0, zmmword ptr [rdx + rsi - 64]
	vpaddb	zmm4, zmm0, zmmword ptr [rdx + rsi]
; ││└
; ││┌ @ subarray.jl:351 within `setindex!` @ array.jl:969
	vmovdqu64	zmmword ptr [rdx + rsi - 192], zmm1
	vmovdqu64	zmmword ptr [rdx + rsi - 128], zmm2
	vmovdqu64	zmmword ptr [rdx + rsi - 64], zmm3
	vmovdqu64	zmmword ptr [rdx + rsi], zmm4
; ││└
; ││ @ simdloop.jl:78 within `macro expansion`
; ││┌ @ int.jl:87 within `+`
	add	rsi, 256
	cmp	rdi, rsi
	jne	.LBB0_8

(vmovdqu64 instead of vmovdqa64)

gbaraldi · 2023-01-05T18:07:12Z

Attempts at this #21959 and #22649

chriselrod · 2023-04-26T19:39:39Z

julia> x = rand(Float64, 120);

julia> pushfirst!(x,0.0);

julia> Int(pointer(x))%16
8

julia> reshape(x, (11,11)) |> typeof
Matrix{Float64} (alias for Array{Float64, 2})

julia> Int(pointer(reshape(x, (11,11))))%16
8

julia> x = rand(Float64, 122);

julia> popfirst!(x);

julia> Int(pointer(x))%16
8

julia> reshape(x, (11,11)) |> typeof
Matrix{Float64} (alias for Array{Float64, 2})

julia> Int(pointer(reshape(x, (11,11))))%16
8

We can't assume Arrays are 16-byte aligned.
pushfirst! and popfirst! both shift the base pointer.

Aligned loads/stores may be faster, but on recent architectures aligned vmovupd isn't any slower than vmovapd. The difference is that unaligned vmovupds are slower, while unaligned vmovapds crash your program.

Seelengrab · 2023-04-26T20:58:39Z

Arrays are not the only objects with alignment. The shift on pushfirst!/popfirst! also depends on the eltype, and only emitting unaligned loads/stores for those that actually need to be unaligned seems more optimal. Do you have a benchmark for vmovupd vs. vmovapd at hand?

chriselrod · 2023-04-26T23:58:54Z

zmm vmovupd load
zmm vmovapd load
zmm vmovupd store
zmm vmovapd store
You can look up instructions here: uops info table

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make sure LLVM knows our allocation alignment #48139

Make sure LLVM knows our allocation alignment #48139

Seelengrab commented Jan 5, 2023

haampie commented Jan 5, 2023 •

edited

Loading

Seelengrab commented Jan 5, 2023

gbaraldi commented Jan 5, 2023

chriselrod commented Apr 26, 2023 •

edited

Loading

Seelengrab commented Apr 26, 2023

chriselrod commented Apr 26, 2023 •

edited

Loading

Make sure LLVM knows our allocation alignment #48139

Make sure LLVM knows our allocation alignment #48139

Comments

Seelengrab commented Jan 5, 2023

haampie commented Jan 5, 2023 • edited Loading

Seelengrab commented Jan 5, 2023

gbaraldi commented Jan 5, 2023

chriselrod commented Apr 26, 2023 • edited Loading

Seelengrab commented Apr 26, 2023

chriselrod commented Apr 26, 2023 • edited Loading

haampie commented Jan 5, 2023 •

edited

Loading

chriselrod commented Apr 26, 2023 •

edited

Loading

chriselrod commented Apr 26, 2023 •

edited

Loading