Incorporate `device`/strides/offsets into indexing pipeline #94

Tokazama · 2020-12-05T03:08:46Z

The goal of this is to finally integrate the indexing pipeline with a lot of the contributions from @chriselrod to this package. This is far from ready but I could use some input before moving forward with some things. I'm mainly concerned about how to optimize iterating over linear indices and utilize CPUPointer in a generic way. I'll clarify by commenting on the code in this PR.

Tokazama · 2020-12-05T03:12:07Z

src/indexing.jl

+function unsafe_get_collection(::CPUPointer, src, inds; kwargs...)
+    sz = static_length.(inds)
+    return unsafe_get_collection_by_pointer(          # FIXME not an actual function
+        allocate_memory(src, sz),                     # FIXME not an actual function
+        OptionallyStaticUnitRange(One(), prod(sz)),
+        pointer(src),
+        @inbounds(view(LinearIndices(src), inds...))  # FIXME not ideal iterator
+    )
+end


Should we simply call unsae_get_collection_by_pointer(pointer(src), src, inds) here? I'd like to get as much of the indexing defined generically here as possible, and then maybe LoopVectorization could specialize based on pointer type. It would be nice to atleast get this working with Array, but perhaps optimizing that is too much for this package?

You mean should the yet too be implemented unsafe_get_collection_by_pointer function just take those as arguments? That's fair, it could call static_length and allocate memory itself. If sz isa StaticInt, it'd be convenient to know the size, so we may want to pass it as an arg at some point (i.e., define another function to call) to get the parameters of the OptionallyStaticUnitRange.

Also, I'd always match a pointer(src) with a GC.@preserve src.

My plan is to have LoopVectorization automate the optimization on this stuff, so PaddedMatrices (for example) can use simpler optimized definitions using LoopVectorization.

Also, I'd always match a pointer(src) with a GC.@preserve src.

Good point

My plan is to have LoopVectorization automate the optimization on this stuff, so PaddedMatrices (for example) can use simpler optimized definitions using LoopVectorization.

That sounds great! I'm just trying to tie together these different components of the interface into a cohesive product. Hopefully that will make it easier to get other packages integrated with this interface.

Perhaps a better approach would be something like:

function unsafe_get_collection(::CPUPointer, A, inds) axs = to_axes(A, inds) src = pointer(A) len = prod(map(static_length, inds)) dst = similar_pointer(src, len) GC.@preserve A, unsafe_copyto_index!(dst, One():len, src, tdot(A, inds)) return unsafe_reconstruct(A, dst; axes=axs) end @inline function unsafe_copyto_index!(dst, dst_itr, src, src_iter) dst_i = iterate(dst_itr) src_i = iterate(src_itr) while dst_i !== nothing unsafe_store!(dst, unsafe_load(src, first(src_i)), first(dst_i)) dst_i = iterate(dst_itr, last(dst_i)) src_i = iterate(src_itr, last(src_i)) end end function similar_pointer(x, sz) return Base.unsafe_convert(Ptr{eltype(x)}, Libc.malloc(8 * sizeof(eltype(x)) * Int(sz))) end

This uses VectorizationBase.tdot to tie this all together. It looks like there's a good amount of code that is necessary to support it, but it seems like the best way to provide support for array types in base that return CPUPointer. I'd also need to figure out how to support arrays that have non-bits type elements (eg., Array{Any}), but then LoopVectorization could use LLVM magic in its own unsafe_copyto_index! method for strided pointers.

Alternatively, we could push off supporting any CPUPointer indexing to the vectorization packages. I think there's a huge advantage to having something that ties these traits together right out of the gait. However, if this minor convenience makes it harder for you to maintain LoopVectorization and friends then I'm totally fine with focusing on generic CPUPointer support elsewhere.

VectorizationBase.tdot is much more complicated than you need here.
I'm sure with some work it'd be possible to simplify that implementation while keeping its functionality intact, but we don't need most of what it's meant to handle here.
That complexity stems from trying to make sure LLVM generates good pointer addressing code.
Basically, this is an AVX512 matmul kernel LoopVectorization generates:

L928: vmovupd zmm29, zmmword ptr [rbp] vmovupd zmm28, zmmword ptr [rbp + 64] vmovupd zmm27, zmmword ptr [rbp + 128] prefetcht0 byte ptr [rbp + r12 - 128] vbroadcastsd zmm30, qword ptr [rcx - 8] prefetcht0 byte ptr [rbp + r12 - 64] prefetcht0 byte ptr [rbp + r12] vfmadd231pd zmm0, zmm29, zmm30 # zmm0 = (zmm29 * zmm30) + zmm0 vfmadd231pd zmm1, zmm28, zmm30 # zmm1 = (zmm28 * zmm30) + zmm1 vbroadcastsd zmm31, qword ptr [rbx + rsi] vfmadd231pd zmm2, zmm27, zmm30 # zmm2 = (zmm27 * zmm30) + zmm2 vfmadd231pd zmm3, zmm29, zmm31 # zmm3 = (zmm29 * zmm31) + zmm3 vfmadd231pd zmm4, zmm28, zmm31 # zmm4 = (zmm28 * zmm31) + zmm4 vfmadd231pd zmm5, zmm27, zmm31 # zmm5 = (zmm27 * zmm31) + zmm5 vbroadcastsd zmm30, qword ptr [rbx + 2*rsi] vfmadd231pd zmm6, zmm29, zmm30 # zmm6 = (zmm29 * zmm30) + zmm6 vfmadd231pd zmm7, zmm28, zmm30 # zmm7 = (zmm28 * zmm30) + zmm7 vfmadd231pd zmm8, zmm27, zmm30 # zmm8 = (zmm27 * zmm30) + zmm8 vbroadcastsd zmm30, qword ptr [rbx + r15] vfmadd231pd zmm9, zmm29, zmm30 # zmm9 = (zmm29 * zmm30) + zmm9 vfmadd231pd zmm10, zmm28, zmm30 # zmm10 = (zmm28 * zmm30) + zmm10 vfmadd231pd zmm11, zmm27, zmm30 # zmm11 = (zmm27 * zmm30) + zmm11 vbroadcastsd zmm30, qword ptr [rbx + 4*rsi] vfmadd231pd zmm12, zmm29, zmm30 # zmm12 = (zmm29 * zmm30) + zmm12 vfmadd231pd zmm13, zmm28, zmm30 # zmm13 = (zmm28 * zmm30) + zmm13 vfmadd231pd zmm14, zmm27, zmm30 # zmm14 = (zmm27 * zmm30) + zmm14 vbroadcastsd zmm30, qword ptr [rbx + r14] vfmadd231pd zmm15, zmm29, zmm30 # zmm15 = (zmm29 * zmm30) + zmm15 vfmadd231pd zmm16, zmm28, zmm30 # zmm16 = (zmm28 * zmm30) + zmm16 vfmadd231pd zmm17, zmm27, zmm30 # zmm17 = (zmm27 * zmm30) + zmm17 vbroadcastsd zmm30, qword ptr [rbx + 2*r15] vfmadd231pd zmm18, zmm29, zmm30 # zmm18 = (zmm29 * zmm30) + zmm18 vfmadd231pd zmm19, zmm28, zmm30 # zmm19 = (zmm28 * zmm30) + zmm19 vbroadcastsd zmm31, qword ptr [rbx + rax] vfmadd231pd zmm20, zmm27, zmm30 # zmm20 = (zmm27 * zmm30) + zmm20 vfmadd231pd zmm21, zmm29, zmm31 # zmm21 = (zmm29 * zmm31) + zmm21 vfmadd231pd zmm22, zmm28, zmm31 # zmm22 = (zmm28 * zmm31) + zmm22 vfmadd231pd zmm23, zmm27, zmm31 # zmm23 = (zmm27 * zmm31) + zmm23 vbroadcastsd zmm30, qword ptr [rcx + 8*rsi - 8] vfmadd231pd zmm24, zmm30, zmm29 # zmm24 = (zmm30 * zmm29) + zmm24 vfmadd231pd zmm25, zmm30, zmm28 # zmm25 = (zmm30 * zmm28) + zmm25 vfmadd231pd zmm26, zmm27, zmm30 # zmm26 = (zmm27 * zmm30) + zmm26 add rbp, r13 mov rbx, rcx add rcx, 8 cmp rbp, rdx jbe L928

It is not perfect, but much better than it was half a year ago. Focusing on just the broadcasts:

vbroadcastsd zmm30, qword ptr [rcx - 8] vbroadcastsd zmm31, qword ptr [rbx + rsi] vbroadcastsd zmm30, qword ptr [rbx + 2*rsi] vbroadcastsd zmm30, qword ptr [rbx + r15] vbroadcastsd zmm30, qword ptr [rbx + 4*rsi] vbroadcastsd zmm30, qword ptr [rbx + r14] vbroadcastsd zmm30, qword ptr [rbx + 2*r15] vbroadcastsd zmm31, qword ptr [rbx + rax] vbroadcastsd zmm30, qword ptr [rcx + 8*rsi - 8]

These load 1 scalar from the memory location in brackets, e.g. from the pointer rcx minus 8 ([rcx - 8]), and copy it across a SIMD register.
The memory addressing lets us do a fairly complicated expression:
a * x + c + y, where a and c have to be constants (literal integers), while x and y can be integers/pointers. a is only allowed to be 0, 1, 2, 4, or, 8.
These loads are across strides of an array, i.e.
B[i,j], B[i,j+1], B[i,j+2], B[i,j+3], ...

So basically, c is a pointer to B[i,j], and then x is the stride in bytes.
This allows us to load from B[i,j+1] (a = 1), B[i,j+2] (a=2), B[i,j+4] (a = 4), and B[i,j+8].
Then I have to also precalculate values equal to 3x the stride, 5x the stride, and 7x the stride, getting
B[i,j+3] (3x stride, a = 1), B[i,j+6] (3x stride, a = 2), then the +5 and +7 with a = 1.

LLVM will not automatically do anything like this, so it needs to be heavy handed in how the code gen coerces it.
Also, the code is imperfect, because for some reason there's both rcx and rbx, where rcx == rbx + 8, so it subtracts 8 every time it uses rcx instead of rbx. I have no idea why the compiler would think that's a good idea, but it means we end up with

mov rbx, rcx add rcx, 8

when all we should have is

add rbx, 8

and replace all the rcx - 8s with rbx. But until I figure out how to convince LLVM to do that, I'll live with it.

Also, something else it has to do is avoid promoting the MM type (which corresponds to vector loads/stores) into a Vec type (which corresponds to slow gather/scatters), unless that is what the indexing requires.

But all this stuff, which the complexity comes from, isn't really necessary here.

I think there's a huge advantage to having something that ties these traits together right out of the gait.

I agree. I don't think people should have to depend on LoopVectorization for support (my goal is just to make people want to ;) ).
I think it'd be good to provide an efficient default, but make it possible for people to opt into an alternative method.

Tokazama · 2020-12-05T03:15:56Z

src/indexing.jl

+# TODO
+function unsafe_get_collection_by_index(::IndexLinear, src, inds; kwargs...)
+    dst = similar(src, Base.index_shape(inds...))
+    src_iter = @inbounds(view(LinearIndices(src), inds...)) # FIXME not ideal iterator


I assume that the default Base._generate_unsafe_getindex!_body isn't as efficient as it could be because it uses cartesian like APL indexing. I figure we could take advantage of arrays where we know the indexing style is linear. However, I'm not aware of an iterator that is ideal for this right now.

It's probably pretty good.
Normally, loads are a lot more important for performance than stores. That's because lots of subsequent operations tend to be dependent on loads, but that wouldn't be the case for a store unless it's immediately reloaded. Meaning an out of order CPU normally won't have to wait on a store before executing future instructions.

But here, basically all the CPU is doing is loading and storing, so there isn't any other work the CPU can do instead.
And most CPUs can perform twice as many loads per cycle as stores.
Meaning the stores are probably more important here.

And, presumably dest is getting allocated to column major, so that this order is pretty good for it. And if we have basically anything other than UnitRanges in I::Vararg{Union{Real, AbstractArray}, N}, the access pattern on src is going to require slow gather instructions anyway.

So the base implementation is probably as good as we can do for CPUIndex?

Yeah, it's definitely most of the way there.

chriselrod · 2020-12-05T06:44:14Z

It'd also be good to get the batch integrated as well.

Tokazama · 2020-12-05T13:18:42Z

Also, I want to make sure I'm using CheckParent correctly. Is it intended as an indicator that data is stored in a parent type and information about retrieving said data (indexing, pointers, etc.), is entirely dependent on a parent type?

For example, if I was simply attaching some random information to an array I would probably want to have device return CheckParent but if it transposes a matrix that wrapper is critical to how data is accessed.

chriselrod · 2020-12-05T13:56:27Z

Also, I want to make sure I'm using CheckParent correctly. Is it intended as an indicator that data is stored in a parent type and information about retrieving said data (indexing, pointers, etc.), is entirely dependent on a parent type?

Yes.
Here is where I'm using CheckParent currently.

For example, if I was simply attaching some random information to an array I would probably want to have device return CheckParent

Yes, you'd probably want it to return CheckParent, because presumably the means of accessing the data are dependent on the parent, e.g. whether it's on the GPU, you can get a pointer, or w/e.

I'm not entirely sure what the point of CheckParent is at the moment, vs the fallback definition of device, which uses using parent_type in the fall back definition...

I think we should probably remove CheckParent. Probably not worth going to 3.0 over, but could deprecate it. I'd be surprised if anyone other than me was using it.

but if it transposes a matrix that wrapper is critical to how data is accessed.

device is about means of accessing data, which isn't changed by adjoint/transpose wrappers.
stride_rank would be about access patern/order, and it is changed by adjoint/transpose wrappers.

Tokazama · 2020-12-05T14:09:17Z

device is about means of accessing data, which isn't changed by adjoint/transpose wrappers. stride_rank would be about access patern/order, and it is changed by adjoint/transpose wrappers.

The distinction I was trying to create was that if it is CheckParent then we don't want to directly act on that structure but instead unwrap it. So if I had something like

struct FriendlyArray{T,N,P} <: AbstractArray{T,N}
    greeting::String
    parent::P
end

unsafe_reconstruct(x::FriendlyArray, data) = FriendlyArray(x.greeting, data)

this would preserve the non-array specific info, as opposed to Transpose where we don't care to preserve the wrapper b/c the new array isn't transposed anymore.

But now that I've typed that all out I think it would be more straightforward to just let people define their own unsafe_get_collection(::FriendlyArray, inds)?

chriselrod · 2020-12-05T14:38:53Z

as opposed to Transpose where we don't care to preserve the wrapper b/c the new array isn't transposed anymore.

I don't think I follow here.
The new array being allocated won't be transposed, but we need to keep the wrapper on the array being transposed, because the wrapper changes behavior meaning our results would otherwise be wrong?

Tokazama · 2020-12-05T15:08:38Z

Yes. Because if we simply went to the parent of Transpose then strides would be in the incorrect order.

I was just trying to reason about CheckParent when there is a parent array but we don't want to simply jump to the parent array for indexing; but I think the traits in "stridelayout.jl" take care of this and if an array has something completely out there it should just define a new unsafe_get_collection.

chriselrod · 2020-12-05T15:14:08Z

CheckParent was about checking the parent for device info, not for any other information.
You'd look to the transposed (child) array for stride information.

…l array internal This allows using the internal `_contiguous_batch_size` method for indexing without actually creating a SubArray

The goal here was to create minimally optimized representation of array indexing that is agnostic to the hardware it's on. Once we get to `unsafe_get_collection(::CPUPointer,...)` we 1. Create an `IndexMap` for each indexing argument 2. Identify slice indexing args that map contiguous strides and combine them (eliminating unecessary iterations across loops and stride calculations) 3. Combine indexing arguments that result in dropped dimensions (e.g., the last argument of `A[:,1]` drops a dimension). These are computed at the head of each indexing call so that they `((index - offset) * stride)` isn't computed with each iteration. 4. Parse into `Expr` and construct the loop Right now this is slower than Base, requires a bunch of extr code for parsing into `Expr` and is buggy. So it's more of concept than a full featured implementation right now.

Tokazama · 2020-12-16T19:22:03Z

My latest commit message describes what I'm trying to do here but doesn't provide any examples. So here are some examples of code generated with this:

ArrayInterface.getindex(A, :, :) - fuse contiguous loops

quote
    var"##inds_(1, 2)#266" = 0:length(inds[1]) * length(inds[1]) - 1
    for var"##267" = var"##inds_(1, 2)#266"
        var"##strided_1#268" = var"##267"
        dst_i === nothing && break
        unsafe_copyto!(dst + first(dst_i) * 8, src + var"##strided_1#268" * 8, 1)
        dst_i = iterate(dst_itr, last(dst_i))
    end
end

ArrayInterface.getindex(A, :, 1) - compute stride from last index outside of loop

quote
    var"##inds_(1,)#269" = 0:length(inds[1]) - 1
    var"##strided_2#273" = (inds[2] - ArrayInterface.StaticInt{1}()) * s[2]
    for var"##270" = var"##inds_(1,)#269"
        var"##strided_1#272" = var"##270" + var"##strided_2#273"
        dst_i === nothing && break
        unsafe_copyto!(dst + first(dst_i) * 8, src + var"##strided_1#272" * 8, 1)
        dst_i = iterate(dst_itr, last(dst_i))
    end
end

I lurked around the "vectorization" package family quite a bit over the past week and I'm fully aware this isn't anywhere near fully optimized. I mostly thought it was interesting to see how far I could take IndexMap and the other traits here to generated code. That being said, I'm not sure it makes sense to fully implement code gen here given how involved it will be. I really wanted to get this fully working with arrays in the standard library, but it seems like having something like Array work with this completely requires enough work that you may as well just create a new array type and dedicated package (which is what you're doing elsewhere anyway).

Tokazama · 2021-04-11T05:11:13Z

Closing in favor of #141

Tokazama added 2 commits December 4, 2020 21:44

make CheckParent a potential default for device

30cbae3

integrating device into indexing

a71d98d

Tokazama commented Dec 5, 2020

View reviewed changes

Tokazama added 2 commits December 5, 2020 11:57

Use parameters for contiguous_batch_size on SubArray instead of ful…

cc9a884

…l array internal This allows using the internal `_contiguous_batch_size` method for indexing without actually creating a SubArray

Tokazama mentioned this pull request Dec 17, 2020

Make working with axes (and their types) easy #98

Merged

Tokazama mentioned this pull request Apr 11, 2021

Layouts - combining array structure and indexing #141

Closed

Tokazama closed this Apr 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorporate `device`/strides/offsets into indexing pipeline #94

Incorporate `device`/strides/offsets into indexing pipeline #94

Tokazama commented Dec 5, 2020

Tokazama Dec 5, 2020

chriselrod Dec 5, 2020

Tokazama Dec 7, 2020

chriselrod Dec 7, 2020

Tokazama Dec 5, 2020 •

edited

Loading

chriselrod Dec 5, 2020

Tokazama Dec 5, 2020

chriselrod Dec 8, 2020

chriselrod commented Dec 5, 2020

Tokazama commented Dec 5, 2020 •

edited

Loading

chriselrod commented Dec 5, 2020

Tokazama commented Dec 5, 2020 •

edited

Loading

chriselrod commented Dec 5, 2020

Tokazama commented Dec 5, 2020

chriselrod commented Dec 5, 2020 •

edited

Loading

Tokazama commented Dec 16, 2020

Tokazama commented Apr 11, 2021

Incorporate device/strides/offsets into indexing pipeline #94

Incorporate device/strides/offsets into indexing pipeline #94

Conversation

Tokazama commented Dec 5, 2020

Tokazama Dec 5, 2020

Choose a reason for hiding this comment

chriselrod Dec 5, 2020

Choose a reason for hiding this comment

Tokazama Dec 7, 2020

Choose a reason for hiding this comment

chriselrod Dec 7, 2020

Choose a reason for hiding this comment

Tokazama Dec 5, 2020 • edited Loading

Choose a reason for hiding this comment

chriselrod Dec 5, 2020

Choose a reason for hiding this comment

Tokazama Dec 5, 2020

Choose a reason for hiding this comment

chriselrod Dec 8, 2020

Choose a reason for hiding this comment

chriselrod commented Dec 5, 2020

Tokazama commented Dec 5, 2020 • edited Loading

chriselrod commented Dec 5, 2020

Tokazama commented Dec 5, 2020 • edited Loading

chriselrod commented Dec 5, 2020

Tokazama commented Dec 5, 2020

chriselrod commented Dec 5, 2020 • edited Loading

Tokazama commented Dec 16, 2020

Tokazama commented Apr 11, 2021

Incorporate `device`/strides/offsets into indexing pipeline #94

Incorporate `device`/strides/offsets into indexing pipeline #94

Tokazama Dec 5, 2020 •

edited

Loading

Tokazama commented Dec 5, 2020 •

edited

Loading

Tokazama commented Dec 5, 2020 •

edited

Loading

chriselrod commented Dec 5, 2020 •

edited

Loading