-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorporate device
/strides/offsets into indexing pipeline
#94
Conversation
src/indexing.jl
Outdated
function unsafe_get_collection(::CPUPointer, src, inds; kwargs...) | ||
sz = static_length.(inds) | ||
return unsafe_get_collection_by_pointer( # FIXME not an actual function | ||
allocate_memory(src, sz), # FIXME not an actual function | ||
OptionallyStaticUnitRange(One(), prod(sz)), | ||
pointer(src), | ||
@inbounds(view(LinearIndices(src), inds...)) # FIXME not ideal iterator | ||
) | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we simply call unsae_get_collection_by_pointer(pointer(src), src, inds)
here? I'd like to get as much of the indexing defined generically here as possible, and then maybe LoopVectorization could specialize based on pointer type. It would be nice to atleast get this working with Array
, but perhaps optimizing that is too much for this package?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean should the yet too be implemented unsafe_get_collection_by_pointer
function just take those as arguments? That's fair, it could call static_length
and allocate memory itself. If sz isa StaticInt
, it'd be convenient to know the size, so we may want to pass it as an arg at some point (i.e., define another function to call) to get the parameters of the OptionallyStaticUnitRange
.
Also, I'd always match a pointer(src)
with a GC.@preserve src
.
My plan is to have LoopVectorization automate the optimization on this stuff, so PaddedMatrices (for example) can use simpler optimized definitions using LoopVectorization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I'd always match a pointer(src) with a GC.@preserve src.
Good point
My plan is to have LoopVectorization automate the optimization on this stuff, so PaddedMatrices (for example) can use simpler optimized definitions using LoopVectorization.
That sounds great! I'm just trying to tie together these different components of the interface into a cohesive product. Hopefully that will make it easier to get other packages integrated with this interface.
Perhaps a better approach would be something like:
function unsafe_get_collection(::CPUPointer, A, inds)
axs = to_axes(A, inds)
src = pointer(A)
len = prod(map(static_length, inds))
dst = similar_pointer(src, len)
GC.@preserve A, unsafe_copyto_index!(dst, One():len, src, tdot(A, inds))
return unsafe_reconstruct(A, dst; axes=axs)
end
@inline function unsafe_copyto_index!(dst, dst_itr, src, src_iter)
dst_i = iterate(dst_itr)
src_i = iterate(src_itr)
while dst_i !== nothing
unsafe_store!(dst, unsafe_load(src, first(src_i)), first(dst_i))
dst_i = iterate(dst_itr, last(dst_i))
src_i = iterate(src_itr, last(src_i))
end
end
function similar_pointer(x, sz)
return Base.unsafe_convert(Ptr{eltype(x)}, Libc.malloc(8 * sizeof(eltype(x)) * Int(sz)))
end
This uses VectorizationBase.tdot
to tie this all together. It looks like there's a good amount of code that is necessary to support it, but it seems like the best way to provide support for array types in base that return CPUPointer
. I'd also need to figure out how to support arrays that have non-bits type elements (eg., Array{Any}
), but then LoopVectorization could use LLVM magic in its own unsafe_copyto_index!
method for strided pointers.
Alternatively, we could push off supporting any CPUPointer
indexing to the vectorization packages. I think there's a huge advantage to having something that ties these traits together right out of the gait. However, if this minor convenience makes it harder for you to maintain LoopVectorization and friends then I'm totally fine with focusing on generic CPUPointer
support elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
VectorizationBase.tdot
is much more complicated than you need here.
I'm sure with some work it'd be possible to simplify that implementation while keeping its functionality intact, but we don't need most of what it's meant to handle here.
That complexity stems from trying to make sure LLVM generates good pointer addressing code.
Basically, this is an AVX512 matmul kernel LoopVectorization generates:
L928:
vmovupd zmm29, zmmword ptr [rbp]
vmovupd zmm28, zmmword ptr [rbp + 64]
vmovupd zmm27, zmmword ptr [rbp + 128]
prefetcht0 byte ptr [rbp + r12 - 128]
vbroadcastsd zmm30, qword ptr [rcx - 8]
prefetcht0 byte ptr [rbp + r12 - 64]
prefetcht0 byte ptr [rbp + r12]
vfmadd231pd zmm0, zmm29, zmm30 # zmm0 = (zmm29 * zmm30) + zmm0
vfmadd231pd zmm1, zmm28, zmm30 # zmm1 = (zmm28 * zmm30) + zmm1
vbroadcastsd zmm31, qword ptr [rbx + rsi]
vfmadd231pd zmm2, zmm27, zmm30 # zmm2 = (zmm27 * zmm30) + zmm2
vfmadd231pd zmm3, zmm29, zmm31 # zmm3 = (zmm29 * zmm31) + zmm3
vfmadd231pd zmm4, zmm28, zmm31 # zmm4 = (zmm28 * zmm31) + zmm4
vfmadd231pd zmm5, zmm27, zmm31 # zmm5 = (zmm27 * zmm31) + zmm5
vbroadcastsd zmm30, qword ptr [rbx + 2*rsi]
vfmadd231pd zmm6, zmm29, zmm30 # zmm6 = (zmm29 * zmm30) + zmm6
vfmadd231pd zmm7, zmm28, zmm30 # zmm7 = (zmm28 * zmm30) + zmm7
vfmadd231pd zmm8, zmm27, zmm30 # zmm8 = (zmm27 * zmm30) + zmm8
vbroadcastsd zmm30, qword ptr [rbx + r15]
vfmadd231pd zmm9, zmm29, zmm30 # zmm9 = (zmm29 * zmm30) + zmm9
vfmadd231pd zmm10, zmm28, zmm30 # zmm10 = (zmm28 * zmm30) + zmm10
vfmadd231pd zmm11, zmm27, zmm30 # zmm11 = (zmm27 * zmm30) + zmm11
vbroadcastsd zmm30, qword ptr [rbx + 4*rsi]
vfmadd231pd zmm12, zmm29, zmm30 # zmm12 = (zmm29 * zmm30) + zmm12
vfmadd231pd zmm13, zmm28, zmm30 # zmm13 = (zmm28 * zmm30) + zmm13
vfmadd231pd zmm14, zmm27, zmm30 # zmm14 = (zmm27 * zmm30) + zmm14
vbroadcastsd zmm30, qword ptr [rbx + r14]
vfmadd231pd zmm15, zmm29, zmm30 # zmm15 = (zmm29 * zmm30) + zmm15
vfmadd231pd zmm16, zmm28, zmm30 # zmm16 = (zmm28 * zmm30) + zmm16
vfmadd231pd zmm17, zmm27, zmm30 # zmm17 = (zmm27 * zmm30) + zmm17
vbroadcastsd zmm30, qword ptr [rbx + 2*r15]
vfmadd231pd zmm18, zmm29, zmm30 # zmm18 = (zmm29 * zmm30) + zmm18
vfmadd231pd zmm19, zmm28, zmm30 # zmm19 = (zmm28 * zmm30) + zmm19
vbroadcastsd zmm31, qword ptr [rbx + rax]
vfmadd231pd zmm20, zmm27, zmm30 # zmm20 = (zmm27 * zmm30) + zmm20
vfmadd231pd zmm21, zmm29, zmm31 # zmm21 = (zmm29 * zmm31) + zmm21
vfmadd231pd zmm22, zmm28, zmm31 # zmm22 = (zmm28 * zmm31) + zmm22
vfmadd231pd zmm23, zmm27, zmm31 # zmm23 = (zmm27 * zmm31) + zmm23
vbroadcastsd zmm30, qword ptr [rcx + 8*rsi - 8]
vfmadd231pd zmm24, zmm30, zmm29 # zmm24 = (zmm30 * zmm29) + zmm24
vfmadd231pd zmm25, zmm30, zmm28 # zmm25 = (zmm30 * zmm28) + zmm25
vfmadd231pd zmm26, zmm27, zmm30 # zmm26 = (zmm27 * zmm30) + zmm26
add rbp, r13
mov rbx, rcx
add rcx, 8
cmp rbp, rdx
jbe L928
It is not perfect, but much better than it was half a year ago. Focusing on just the broadcasts:
vbroadcastsd zmm30, qword ptr [rcx - 8]
vbroadcastsd zmm31, qword ptr [rbx + rsi]
vbroadcastsd zmm30, qword ptr [rbx + 2*rsi]
vbroadcastsd zmm30, qword ptr [rbx + r15]
vbroadcastsd zmm30, qword ptr [rbx + 4*rsi]
vbroadcastsd zmm30, qword ptr [rbx + r14]
vbroadcastsd zmm30, qword ptr [rbx + 2*r15]
vbroadcastsd zmm31, qword ptr [rbx + rax]
vbroadcastsd zmm30, qword ptr [rcx + 8*rsi - 8]
These load 1 scalar from the memory location in brackets, e.g. from the pointer rcx
minus 8
([rcx - 8]
), and copy it across a SIMD register.
The memory addressing lets us do a fairly complicated expression:
a * x + c + y
, where a
and c
have to be constants (literal integers), while x
and y
can be integers/pointers. a
is only allowed to be 0
, 1
, 2
, 4
, or, 8
.
These loads are across strides of an array, i.e.
B[i,j]
, B[i,j+1]
, B[i,j+2]
, B[i,j+3]
, ...
So basically, c
is a pointer to B[i,j]
, and then x
is the stride in bytes.
This allows us to load from B[i,j+1]
(a = 1), B[i,j+2]
(a=2), B[i,j+4]
(a = 4), and B[i,j+8]
.
Then I have to also precalculate values equal to 3x the stride, 5x the stride, and 7x the stride, getting
B[i,j+3]
(3x stride, a = 1), B[i,j+6]
(3x stride, a = 2), then the +5 and +7 with a = 1
.
LLVM will not automatically do anything like this, so it needs to be heavy handed in how the code gen coerces it.
Also, the code is imperfect, because for some reason there's both rcx
and rbx
, where rcx == rbx + 8
, so it subtracts 8
every time it uses rcx
instead of rbx
. I have no idea why the compiler would think that's a good idea, but it means we end up with
mov rbx, rcx
add rcx, 8
when all we should have is
add rbx, 8
and replace all the rcx - 8
s with rbx
. But until I figure out how to convince LLVM to do that, I'll live with it.
Also, something else it has to do is avoid promoting the MM
type (which corresponds to vector loads/stores) into a Vec
type (which corresponds to slow gather/scatters), unless that is what the indexing requires.
But all this stuff, which the complexity comes from, isn't really necessary here.
I think there's a huge advantage to having something that ties these traits together right out of the gait.
I agree. I don't think people should have to depend on LoopVectorization
for support (my goal is just to make people want to ;) ).
I think it'd be good to provide an efficient default, but make it possible for people to opt into an alternative method.
# TODO | ||
function unsafe_get_collection_by_index(::IndexLinear, src, inds; kwargs...) | ||
dst = similar(src, Base.index_shape(inds...)) | ||
src_iter = @inbounds(view(LinearIndices(src), inds...)) # FIXME not ideal iterator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume that the default Base._generate_unsafe_getindex!_body
isn't as efficient as it could be because it uses cartesian like APL indexing. I figure we could take advantage of arrays where we know the indexing style is linear. However, I'm not aware of an iterator that is ideal for this right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's probably pretty good.
Normally, loads are a lot more important for performance than stores. That's because lots of subsequent operations tend to be dependent on loads, but that wouldn't be the case for a store unless it's immediately reloaded. Meaning an out of order CPU normally won't have to wait on a store before executing future instructions.
But here, basically all the CPU is doing is loading and storing, so there isn't any other work the CPU can do instead.
And most CPUs can perform twice as many loads per cycle as stores.
Meaning the stores are probably more important here.
And, presumably dest
is getting allocated to column major, so that this order is pretty good for it. And if we have basically anything other than UnitRange
s in I::Vararg{Union{Real, AbstractArray}, N}
, the access pattern on src
is going to require slow gather
instructions anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the base implementation is probably as good as we can do for CPUIndex?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's definitely most of the way there.
It'd also be good to get the |
Also, I want to make sure I'm using For example, if I was simply attaching some random information to an array I would probably want to have |
Yes.
Yes, you'd probably want it to return I'm not entirely sure what the point of I think we should probably remove
|
The distinction I was trying to create was that if it is struct FriendlyArray{T,N,P} <: AbstractArray{T,N}
greeting::String
parent::P
end
unsafe_reconstruct(x::FriendlyArray, data) = FriendlyArray(x.greeting, data) this would preserve the non-array specific info, as opposed to But now that I've typed that all out I think it would be more straightforward to just let people define their own |
I don't think I follow here. |
Yes. Because if we simply went to the parent of I was just trying to reason about |
|
…l array internal This allows using the internal `_contiguous_batch_size` method for indexing without actually creating a SubArray
The goal here was to create minimally optimized representation of array indexing that is agnostic to the hardware it's on. Once we get to `unsafe_get_collection(::CPUPointer,...)` we 1. Create an `IndexMap` for each indexing argument 2. Identify slice indexing args that map contiguous strides and combine them (eliminating unecessary iterations across loops and stride calculations) 3. Combine indexing arguments that result in dropped dimensions (e.g., the last argument of `A[:,1]` drops a dimension). These are computed at the head of each indexing call so that they `((index - offset) * stride)` isn't computed with each iteration. 4. Parse into `Expr` and construct the loop Right now this is slower than Base, requires a bunch of extr code for parsing into `Expr` and is buggy. So it's more of concept than a full featured implementation right now.
My latest commit message describes what I'm trying to do here but doesn't provide any examples. So here are some examples of code generated with this:
quote
var"##inds_(1, 2)#266" = 0:length(inds[1]) * length(inds[1]) - 1
for var"##267" = var"##inds_(1, 2)#266"
var"##strided_1#268" = var"##267"
dst_i === nothing && break
unsafe_copyto!(dst + first(dst_i) * 8, src + var"##strided_1#268" * 8, 1)
dst_i = iterate(dst_itr, last(dst_i))
end
end
quote
var"##inds_(1,)#269" = 0:length(inds[1]) - 1
var"##strided_2#273" = (inds[2] - ArrayInterface.StaticInt{1}()) * s[2]
for var"##270" = var"##inds_(1,)#269"
var"##strided_1#272" = var"##270" + var"##strided_2#273"
dst_i === nothing && break
unsafe_copyto!(dst + first(dst_i) * 8, src + var"##strided_1#272" * 8, 1)
dst_i = iterate(dst_itr, last(dst_i))
end
end I lurked around the "vectorization" package family quite a bit over the past week and I'm fully aware this isn't anywhere near fully optimized. I mostly thought it was interesting to see how far I could take |
Closing in favor of #141 |
The goal of this is to finally integrate the indexing pipeline with a lot of the contributions from @chriselrod to this package. This is far from ready but I could use some input before moving forward with some things. I'm mainly concerned about how to optimize iterating over linear indices and utilize
CPUPointer
in a generic way. I'll clarify by commenting on the code in this PR.