Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorporate device/strides/offsets into indexing pipeline #94

Closed
wants to merge 4 commits into from
Closed

Incorporate device/strides/offsets into indexing pipeline #94

wants to merge 4 commits into from

Conversation

Tokazama
Copy link
Member

@Tokazama Tokazama commented Dec 5, 2020

The goal of this is to finally integrate the indexing pipeline with a lot of the contributions from @chriselrod to this package. This is far from ready but I could use some input before moving forward with some things. I'm mainly concerned about how to optimize iterating over linear indices and utilize CPUPointer in a generic way. I'll clarify by commenting on the code in this PR.

src/indexing.jl Outdated
Comment on lines 543 to 551
function unsafe_get_collection(::CPUPointer, src, inds; kwargs...)
sz = static_length.(inds)
return unsafe_get_collection_by_pointer( # FIXME not an actual function
allocate_memory(src, sz), # FIXME not an actual function
OptionallyStaticUnitRange(One(), prod(sz)),
pointer(src),
@inbounds(view(LinearIndices(src), inds...)) # FIXME not ideal iterator
)
end
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we simply call unsae_get_collection_by_pointer(pointer(src), src, inds) here? I'd like to get as much of the indexing defined generically here as possible, and then maybe LoopVectorization could specialize based on pointer type. It would be nice to atleast get this working with Array, but perhaps optimizing that is too much for this package?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean should the yet too be implemented unsafe_get_collection_by_pointer function just take those as arguments? That's fair, it could call static_length and allocate memory itself. If sz isa StaticInt, it'd be convenient to know the size, so we may want to pass it as an arg at some point (i.e., define another function to call) to get the parameters of the OptionallyStaticUnitRange.

Also, I'd always match a pointer(src) with a GC.@preserve src.

My plan is to have LoopVectorization automate the optimization on this stuff, so PaddedMatrices (for example) can use simpler optimized definitions using LoopVectorization.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I'd always match a pointer(src) with a GC.@preserve src.

Good point

My plan is to have LoopVectorization automate the optimization on this stuff, so PaddedMatrices (for example) can use simpler optimized definitions using LoopVectorization.

That sounds great! I'm just trying to tie together these different components of the interface into a cohesive product. Hopefully that will make it easier to get other packages integrated with this interface.

Perhaps a better approach would be something like:

function unsafe_get_collection(::CPUPointer, A, inds)
    axs = to_axes(A, inds)
    src = pointer(A)
    len = prod(map(static_length, inds))
    dst = similar_pointer(src, len)
    GC.@preserve A, unsafe_copyto_index!(dst, One():len, src, tdot(A, inds))
    return unsafe_reconstruct(A, dst; axes=axs)
end

@inline function unsafe_copyto_index!(dst, dst_itr, src, src_iter)
    dst_i = iterate(dst_itr)
    src_i = iterate(src_itr)
    while dst_i !== nothing
        unsafe_store!(dst, unsafe_load(src, first(src_i)), first(dst_i))
        dst_i = iterate(dst_itr, last(dst_i))
        src_i = iterate(src_itr, last(src_i))
    end
end

function similar_pointer(x, sz)
    return Base.unsafe_convert(Ptr{eltype(x)}, Libc.malloc(8 * sizeof(eltype(x)) * Int(sz)))
end

This uses VectorizationBase.tdot to tie this all together. It looks like there's a good amount of code that is necessary to support it, but it seems like the best way to provide support for array types in base that return CPUPointer. I'd also need to figure out how to support arrays that have non-bits type elements (eg., Array{Any}), but then LoopVectorization could use LLVM magic in its own unsafe_copyto_index! method for strided pointers.

Alternatively, we could push off supporting any CPUPointer indexing to the vectorization packages. I think there's a huge advantage to having something that ties these traits together right out of the gait. However, if this minor convenience makes it harder for you to maintain LoopVectorization and friends then I'm totally fine with focusing on generic CPUPointer support elsewhere.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VectorizationBase.tdot is much more complicated than you need here.
I'm sure with some work it'd be possible to simplify that implementation while keeping its functionality intact, but we don't need most of what it's meant to handle here.
That complexity stems from trying to make sure LLVM generates good pointer addressing code.
Basically, this is an AVX512 matmul kernel LoopVectorization generates:

L928:
        vmovupd zmm29, zmmword ptr [rbp]
        vmovupd zmm28, zmmword ptr [rbp + 64]
        vmovupd zmm27, zmmword ptr [rbp + 128]
        prefetcht0      byte ptr [rbp + r12 - 128]
        vbroadcastsd    zmm30, qword ptr [rcx - 8]
        prefetcht0      byte ptr [rbp + r12 - 64]
        prefetcht0      byte ptr [rbp + r12]
        vfmadd231pd     zmm0, zmm29, zmm30      # zmm0 = (zmm29 * zmm30) + zmm0
        vfmadd231pd     zmm1, zmm28, zmm30      # zmm1 = (zmm28 * zmm30) + zmm1
        vbroadcastsd    zmm31, qword ptr [rbx + rsi]
        vfmadd231pd     zmm2, zmm27, zmm30      # zmm2 = (zmm27 * zmm30) + zmm2
        vfmadd231pd     zmm3, zmm29, zmm31      # zmm3 = (zmm29 * zmm31) + zmm3
        vfmadd231pd     zmm4, zmm28, zmm31      # zmm4 = (zmm28 * zmm31) + zmm4
        vfmadd231pd     zmm5, zmm27, zmm31      # zmm5 = (zmm27 * zmm31) + zmm5
        vbroadcastsd    zmm30, qword ptr [rbx + 2*rsi]
        vfmadd231pd     zmm6, zmm29, zmm30      # zmm6 = (zmm29 * zmm30) + zmm6
        vfmadd231pd     zmm7, zmm28, zmm30      # zmm7 = (zmm28 * zmm30) + zmm7
        vfmadd231pd     zmm8, zmm27, zmm30      # zmm8 = (zmm27 * zmm30) + zmm8
        vbroadcastsd    zmm30, qword ptr [rbx + r15]
        vfmadd231pd     zmm9, zmm29, zmm30      # zmm9 = (zmm29 * zmm30) + zmm9
        vfmadd231pd     zmm10, zmm28, zmm30     # zmm10 = (zmm28 * zmm30) + zmm10
        vfmadd231pd     zmm11, zmm27, zmm30     # zmm11 = (zmm27 * zmm30) + zmm11
        vbroadcastsd    zmm30, qword ptr [rbx + 4*rsi]
        vfmadd231pd     zmm12, zmm29, zmm30     # zmm12 = (zmm29 * zmm30) + zmm12
        vfmadd231pd     zmm13, zmm28, zmm30     # zmm13 = (zmm28 * zmm30) + zmm13
        vfmadd231pd     zmm14, zmm27, zmm30     # zmm14 = (zmm27 * zmm30) + zmm14
        vbroadcastsd    zmm30, qword ptr [rbx + r14]
        vfmadd231pd     zmm15, zmm29, zmm30     # zmm15 = (zmm29 * zmm30) + zmm15
        vfmadd231pd     zmm16, zmm28, zmm30     # zmm16 = (zmm28 * zmm30) + zmm16
        vfmadd231pd     zmm17, zmm27, zmm30     # zmm17 = (zmm27 * zmm30) + zmm17
        vbroadcastsd    zmm30, qword ptr [rbx + 2*r15]
        vfmadd231pd     zmm18, zmm29, zmm30     # zmm18 = (zmm29 * zmm30) + zmm18
        vfmadd231pd     zmm19, zmm28, zmm30     # zmm19 = (zmm28 * zmm30) + zmm19
        vbroadcastsd    zmm31, qword ptr [rbx + rax]
        vfmadd231pd     zmm20, zmm27, zmm30     # zmm20 = (zmm27 * zmm30) + zmm20
        vfmadd231pd     zmm21, zmm29, zmm31     # zmm21 = (zmm29 * zmm31) + zmm21
        vfmadd231pd     zmm22, zmm28, zmm31     # zmm22 = (zmm28 * zmm31) + zmm22
        vfmadd231pd     zmm23, zmm27, zmm31     # zmm23 = (zmm27 * zmm31) + zmm23
        vbroadcastsd    zmm30, qword ptr [rcx + 8*rsi - 8]
        vfmadd231pd     zmm24, zmm30, zmm29     # zmm24 = (zmm30 * zmm29) + zmm24
        vfmadd231pd     zmm25, zmm30, zmm28     # zmm25 = (zmm30 * zmm28) + zmm25
        vfmadd231pd     zmm26, zmm27, zmm30     # zmm26 = (zmm27 * zmm30) + zmm26
        add     rbp, r13
        mov     rbx, rcx
        add     rcx, 8
        cmp     rbp, rdx
        jbe     L928

It is not perfect, but much better than it was half a year ago. Focusing on just the broadcasts:

        vbroadcastsd    zmm30, qword ptr [rcx - 8]
        vbroadcastsd    zmm31, qword ptr [rbx + rsi]
        vbroadcastsd    zmm30, qword ptr [rbx + 2*rsi]
        vbroadcastsd    zmm30, qword ptr [rbx + r15]
        vbroadcastsd    zmm30, qword ptr [rbx + 4*rsi]
        vbroadcastsd    zmm30, qword ptr [rbx + r14]
        vbroadcastsd    zmm30, qword ptr [rbx + 2*r15]
        vbroadcastsd    zmm31, qword ptr [rbx + rax]
        vbroadcastsd    zmm30, qword ptr [rcx + 8*rsi - 8]

These load 1 scalar from the memory location in brackets, e.g. from the pointer rcx minus 8 ([rcx - 8]), and copy it across a SIMD register.
The memory addressing lets us do a fairly complicated expression:
a * x + c + y, where a and c have to be constants (literal integers), while x and y can be integers/pointers. a is only allowed to be 0, 1, 2, 4, or, 8.
These loads are across strides of an array, i.e.
B[i,j], B[i,j+1], B[i,j+2], B[i,j+3], ...

So basically, c is a pointer to B[i,j], and then x is the stride in bytes.
This allows us to load from B[i,j+1] (a = 1), B[i,j+2] (a=2), B[i,j+4] (a = 4), and B[i,j+8].
Then I have to also precalculate values equal to 3x the stride, 5x the stride, and 7x the stride, getting
B[i,j+3] (3x stride, a = 1), B[i,j+6] (3x stride, a = 2), then the +5 and +7 with a = 1.

LLVM will not automatically do anything like this, so it needs to be heavy handed in how the code gen coerces it.
Also, the code is imperfect, because for some reason there's both rcx and rbx, where rcx == rbx + 8, so it subtracts 8 every time it uses rcx instead of rbx. I have no idea why the compiler would think that's a good idea, but it means we end up with

        mov     rbx, rcx
        add     rcx, 8

when all we should have is

        add     rbx, 8

and replace all the rcx - 8s with rbx. But until I figure out how to convince LLVM to do that, I'll live with it.

Also, something else it has to do is avoid promoting the MM type (which corresponds to vector loads/stores) into a Vec type (which corresponds to slow gather/scatters), unless that is what the indexing requires.

But all this stuff, which the complexity comes from, isn't really necessary here.

I think there's a huge advantage to having something that ties these traits together right out of the gait.

I agree. I don't think people should have to depend on LoopVectorization for support (my goal is just to make people want to ;) ).
I think it'd be good to provide an efficient default, but make it possible for people to opt into an alternative method.

# TODO
function unsafe_get_collection_by_index(::IndexLinear, src, inds; kwargs...)
dst = similar(src, Base.index_shape(inds...))
src_iter = @inbounds(view(LinearIndices(src), inds...)) # FIXME not ideal iterator
Copy link
Member Author

@Tokazama Tokazama Dec 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume that the default Base._generate_unsafe_getindex!_body isn't as efficient as it could be because it uses cartesian like APL indexing. I figure we could take advantage of arrays where we know the indexing style is linear. However, I'm not aware of an iterator that is ideal for this right now.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably pretty good.
Normally, loads are a lot more important for performance than stores. That's because lots of subsequent operations tend to be dependent on loads, but that wouldn't be the case for a store unless it's immediately reloaded. Meaning an out of order CPU normally won't have to wait on a store before executing future instructions.

But here, basically all the CPU is doing is loading and storing, so there isn't any other work the CPU can do instead.
And most CPUs can perform twice as many loads per cycle as stores.
Meaning the stores are probably more important here.

And, presumably dest is getting allocated to column major, so that this order is pretty good for it. And if we have basically anything other than UnitRanges in I::Vararg{Union{Real, AbstractArray}, N}, the access pattern on src is going to require slow gather instructions anyway.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the base implementation is probably as good as we can do for CPUIndex?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's definitely most of the way there.

@chriselrod
Copy link
Collaborator

It'd also be good to get the batch integrated as well.

@Tokazama
Copy link
Member Author

Tokazama commented Dec 5, 2020

Also, I want to make sure I'm using CheckParent correctly. Is it intended as an indicator that data is stored in a parent type and information about retrieving said data (indexing, pointers, etc.), is entirely dependent on a parent type?

For example, if I was simply attaching some random information to an array I would probably want to have device return CheckParent but if it transposes a matrix that wrapper is critical to how data is accessed.

@chriselrod
Copy link
Collaborator

Also, I want to make sure I'm using CheckParent correctly. Is it intended as an indicator that data is stored in a parent type and information about retrieving said data (indexing, pointers, etc.), is entirely dependent on a parent type?

Yes.
Here is where I'm using CheckParent currently.

For example, if I was simply attaching some random information to an array I would probably want to have device return CheckParent

Yes, you'd probably want it to return CheckParent, because presumably the means of accessing the data are dependent on the parent, e.g. whether it's on the GPU, you can get a pointer, or w/e.

I'm not entirely sure what the point of CheckParent is at the moment, vs the fallback definition of device, which uses using parent_type in the fall back definition...

I think we should probably remove CheckParent. Probably not worth going to 3.0 over, but could deprecate it. I'd be surprised if anyone other than me was using it.

but if it transposes a matrix that wrapper is critical to how data is accessed.

device is about means of accessing data, which isn't changed by adjoint/transpose wrappers.
stride_rank would be about access patern/order, and it is changed by adjoint/transpose wrappers.

@Tokazama
Copy link
Member Author

Tokazama commented Dec 5, 2020

device is about means of accessing data, which isn't changed by adjoint/transpose wrappers. stride_rank would be about access patern/order, and it is changed by adjoint/transpose wrappers.

The distinction I was trying to create was that if it is CheckParent then we don't want to directly act on that structure but instead unwrap it. So if I had something like

struct FriendlyArray{T,N,P} <: AbstractArray{T,N}
    greeting::String
    parent::P
end

unsafe_reconstruct(x::FriendlyArray, data) = FriendlyArray(x.greeting, data)

this would preserve the non-array specific info, as opposed to Transpose where we don't care to preserve the wrapper b/c the new array isn't transposed anymore.

But now that I've typed that all out I think it would be more straightforward to just let people define their own unsafe_get_collection(::FriendlyArray, inds)?

@chriselrod
Copy link
Collaborator

as opposed to Transpose where we don't care to preserve the wrapper b/c the new array isn't transposed anymore.

I don't think I follow here.
The new array being allocated won't be transposed, but we need to keep the wrapper on the array being transposed, because the wrapper changes behavior meaning our results would otherwise be wrong?

@Tokazama
Copy link
Member Author

Tokazama commented Dec 5, 2020

Yes. Because if we simply went to the parent of Transpose then strides would be in the incorrect order.

I was just trying to reason about CheckParent when there is a parent array but we don't want to simply jump to the parent array for indexing; but I think the traits in "stridelayout.jl" take care of this and if an array has something completely out there it should just define a new unsafe_get_collection.

@chriselrod
Copy link
Collaborator

chriselrod commented Dec 5, 2020

CheckParent was about checking the parent for device info, not for any other information.
You'd look to the transposed (child) array for stride information.

…l array internal

This allows using the internal `_contiguous_batch_size` method for indexing without actually creating a SubArray
The goal here was to create minimally optimized representation of array
indexing that is agnostic to the hardware it's on. Once we get to
`unsafe_get_collection(::CPUPointer,...)` we
1. Create an `IndexMap` for each indexing argument
2. Identify slice indexing args that map contiguous strides and combine
   them (eliminating unecessary iterations across loops and stride
   calculations)
3. Combine indexing arguments that result in dropped dimensions (e.g.,
   the last argument of `A[:,1]` drops a dimension). These are computed
   at the head of each indexing call so that they `((index - offset) *
   stride)` isn't computed with each iteration.
4. Parse into `Expr` and construct the loop

Right now this is slower than Base, requires a bunch of extr code for
parsing into `Expr` and is buggy. So it's more of concept than a full
featured implementation right now.
@Tokazama
Copy link
Member Author

My latest commit message describes what I'm trying to do here but doesn't provide any examples. So here are some examples of code generated with this:

  • ArrayInterface.getindex(A, :, :) - fuse contiguous loops
quote
    var"##inds_(1, 2)#266" = 0:length(inds[1]) * length(inds[1]) - 1
    for var"##267" = var"##inds_(1, 2)#266"
        var"##strided_1#268" = var"##267"
        dst_i === nothing && break
        unsafe_copyto!(dst + first(dst_i) * 8, src + var"##strided_1#268" * 8, 1)
        dst_i = iterate(dst_itr, last(dst_i))
    end
end
  • ArrayInterface.getindex(A, :, 1) - compute stride from last index outside of loop
quote
    var"##inds_(1,)#269" = 0:length(inds[1]) - 1
    var"##strided_2#273" = (inds[2] - ArrayInterface.StaticInt{1}()) * s[2]
    for var"##270" = var"##inds_(1,)#269"
        var"##strided_1#272" = var"##270" + var"##strided_2#273"
        dst_i === nothing && break
        unsafe_copyto!(dst + first(dst_i) * 8, src + var"##strided_1#272" * 8, 1)
        dst_i = iterate(dst_itr, last(dst_i))
    end
end

I lurked around the "vectorization" package family quite a bit over the past week and I'm fully aware this isn't anywhere near fully optimized. I mostly thought it was interesting to see how far I could take IndexMap and the other traits here to generated code. That being said, I'm not sure it makes sense to fully implement code gen here given how involved it will be. I really wanted to get this fully working with arrays in the standard library, but it seems like having something like Array work with this completely requires enough work that you may as well just create a new array type and dedicated package (which is what you're doing elsewhere anyway).

@Tokazama
Copy link
Member Author

Closing in favor of #141

@Tokazama Tokazama closed this Apr 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants